word wrap.. too annoying to read when editing..

4 years ago · 8658b67d57
--- a/README.md
+++ b/README.md
@@ -1,38 +1,87 @@
 # medashare (Meta Data Sharing)

 The idea for medashare is both a standard, but also an implementation of defining and sharing meta data about files.  Some media contains the ability to embed meta data, e.g. mp3, some documents, video files, but not all files are able to convey the complete and rich information that may be desired.  This will allow you to identify files, and have external associated meta data with various files.

 The idea is also to be able to classify parts of the meta data as private (aka TLP:Red), such that the information will not be shared, so that you can mark files w/ your own tags and/or ratings, which will not be automatically shared.

 There will also be able to define derivative works by the standard.  For example, an album may have multiple tracks, so you can note that an mp3 file is a segment of the album, or that the mp3 file obtained from a music service is equivalent to the track.  This will allow sharing meta data between mediums.

 This derivation is also useful for when files are programmatically transformed.  Say an image is resized, adding a note that the smaller file is a derivative work can allow others to reproduce the file, but also allow you to not have to reenter all the meta data associated with the new version of the file.

 This can be useful for things like raw files on a camera, where you associate general picture information w/ the raw image data (but none of the associated processing data that files like CR2 contains) so that the meta data is not lost.

 This work is inspired by my work on STIX, a Cyber Threat Intelligence standard, that has many similar requirements as meta data sharing.
 The idea for medashare is both a standard, but also an implementation
 of defining and sharing meta data about files.  Some media contains the
 ability to embed meta data, e.g. mp3, some documents, video files, but
 not all files are able to convey the complete and rich information that
 may be desired.  This will allow you to identify files, and have
 external associated meta data with various files.

 The idea is also to be able to classify parts of the meta data as
 private (aka TLP:Red), such that the information will not be shared, so
 that you can mark files w/ your own tags and/or ratings, which will not
 be automatically shared.

 There will also be able to define derivative works by the standard.
 For example, an album may have multiple tracks, so you can note that an
 mp3 file is a segment of the album, or that the mp3 file obtained from
 a music service is equivalent to the track.  This will allow sharing
 meta data between mediums.

 This derivation is also useful for when files are programmatically
 transformed.  Say an image is resized, adding a note that the smaller
 file is a derivative work can allow others to reproduce the file, but
 also allow you to not have to reenter all the meta data associated with
 the new version of the file.

 This can be useful for things like raw files on a camera, where you
 associate general picture information w/ the raw image data (but none
 of the associated processing data that files like CR2 contains) so that
 the meta data is not lost.

 This work is inspired by my work on STIX, a Cyber Threat Intelligence
 standard, that has many similar requirements as meta data sharing.

 ## Goals / Use Cases

 1. Provide meta data, such as title, actors, copyright holder, for files, such as movies, photos, documents.  Meta data means resolution information, page count, tags, and more.
 1. Provide meta data, such as title, actors, copyright holder, for
   files, such as movies, photos, documents.  Meta data means
   resolution information, page count, tags, and more.
 2. Allow look up of meta data by title, actors, etc.
 3. Look up meta data belonging to a file, via file hash.
 4. Support embedded files, such as within a zip file, or bittorrent, so the querier can get all the meta data for a container file, or that the file can be located for download.  For example, the info hash for FreeBSD 11.2-R, which then can be d/l'd.
 5. Identify transformations of files, such as a reencoding of a movie, resizing of a photo, or clips of audio/video.  For example, a CD often has tracks, and there may be a file that is the whole CD, or just one or part of a track.  Both directions should be supported, noting a track is part of a album and when an album has tracks, and links to them.
 6. Possibly use of fingerprint technology, so that the database can be to query meta data based upon parts of the audio/video/image.
 7. Provide meta data for other objects too, such as suggested page down locations for PDFs, or what parts of PDF should be kept on screen in a complete set, so that a page down keeps things readable (and you don't have to arrow up).
 4. Support embedded files, such as within a zip file, or bittorrent, so
   the querier can get all the meta data for a container file, or that
   the file can be located for download.  For example, the info hash
   for FreeBSD 11.2-R, which then can be d/l'd.
 5. Identify transformations of files, such as a reencoding of a movie,
   resizing of a photo, or clips of audio/video.  For example, a CD
   often has tracks, and there may be a file that is the whole CD, or
   just one or part of a track.  Both directions should be supported,
   noting a track is part of a album and when an album has tracks, and
   links to them.
 6. Possibly use of fingerprint technology, so that the database can be
   to query meta data based upon parts of the audio/video/image.
 7. Provide meta data for other objects too, such as suggested page down
   locations for PDFs, or what parts of PDF should be kept on screen in
   a complete set, so that a page down keeps things readable (and you
   don't have to arrow up).
 8. Links to other repositories, such as YouTube videos, SoundCloud, etc.
 9. i18n.  Provide translations for fields as needed.  Often movie titles will have different translations for different markets/languages.  Actors may have different names (e.g. Chinese name vs English name).
 10. Overlaying/replacing meta data from someone else's object.  This may include deleting properties.   Say an actor is missing, or you want to add them to it, or you've encoded the DVD, and you just link to someone's BluRay version.
 11. Provide links to parts of documents.  That is, a particular section/page number of a document for further reference.  In the later case, this likely should use a common canonical format so that all references to page X of document Y results in the same hash.  Or should this just be part of the url/urn format?  (likely)
 9. i18n.  Provide translations for fields as needed.  Often movie titles
   will have different translations for different markets/languages.
   Actors may have different names (e.g. Chinese name vs English name).
 10. Overlaying/replacing meta data from someone else's object.  This may
    include deleting properties.   Say an actor is missing, or you want
    to add them to it, or you've encoded the DVD, and you just link to
    someone's BluRay version.
 11. Provide links to parts of documents.  That is, a particular
    section/page number of a document for further reference.  In the
    later case, this likely should use a common canonical format so that
    all references to page X of document Y results in the same hash.  Or
    should this just be part of the url/urn format?  (likely)

 ## URN

 Each object has a URN which uniquely describes it.  XXX copy from STIX URN proposal, which is simlar to the magnet proposal.
 Each object has a URN which uniquely describes it.  XXX copy from STIX
 URN proposal, which is simlar to the magnet proposal.

 ## Types

 Everything must have a type.  Not having well defined types can lead to confusion and problems.  Different encoding schemes have different ways of encoding types.  If the encoding scheme has a native way to encode that type, it should be used.  In some cases, e.g. JSON, there is no formal types beyond numbers and strings, and in this case, a type should (MUST? or via schemas?) be layered on top.
 Everything must have a type.  Not having well defined types can lead to
 confusion and problems.  Different encoding schemes have different ways
 of encoding types.  If the encoding scheme has a native way to encode
 that type, it should be used.  In some cases, e.g. JSON, there is no
 formal types beyond numbers and strings, and in this case, a type
 should (MUST? or via schemas?) be layered on top.

 ### Integers

@@ -40,7 +89,8 @@ Look at adding units.

 ### Hash String

 The hash string is name of hash (hash type) followed by a colon followed by the hex string (hash value).
 The hash string is name of hash (hash type) followed by a colon followed
 by the hex string (hash value).

 The list of valid hashes is:
 - sha256
@@ -48,11 +98,18 @@ The list of valid hashes is:

 ### Reference

 A reference is the UUID optionally followed by two dashes (--) followed by the modified date of the object.  The modified date is neccessary in some cases to know what version of the object is being referenced.
 A reference is the UUID optionally followed by two dashes (--) followed
 by the modified date of the object.  The modified date is neccessary in
 some cases to know what version of the object is being referenced.

 ### Hashes

 For storing multiple hashes, a hashes type is used.  This will be serialized as a list of hash strings.  The list is used to reduce serialization overhead, and when loaded, may be parsed out into a dictionary for faster lookup.  Only one hash string of each hash type is allowed.  If there are duplicates, the object is invalid and MUST be ignored.
 For storing multiple hashes, a hashes type is used.  This will be
 serialized as a list of hash strings.  The list is used to reduce
 serialization overhead, and when loaded, may be parsed out into a
 dictionary for faster lookup.  Only one hash string of each hash type
 is allowed.  If there are duplicates, the object is invalid and MUST be
 ignored.

 ## Objects

@@ -76,39 +133,75 @@ object_marking_refs	Imported from [STIX v2.0 Part 1]: Section 3.1
 granular_markings	Imported from [STIX v2.0 Part 1]: Section 3.1
 hashes		A list of hash strings.
 lang		RFC XXXX language of the properties.
 parent_refs	List of UUIDv4s of MetaData Object that overlay.  Any properties on this object override the parent. (allow deletion via None/null?)  Any missing properties are passed through to the parent for resolution.  The first/earliest object that has a property is used in that objects later in the list are "hidden" by the earlier objects.
 mimetype	The mime-type.  If the set of bytes is polymorphic, there should be one for each "type".
 parent_refs	List of UUIDv4s of MetaData Object that overlay.  Any
 		properties on this object override the parent. (allow
 		deletion via None/null?)  Any missing properties are
 		passed through to the parent for resolution.  The
 		first/earliest object that has a property is used in
 		that objects later in the list are "hidden" by the
 		earlier objects.
 mimetype	The mime-type.  If the set of bytes is polymorphic,
 		there should be one for each "type".
 uri		List of URI's where the file may be located.
 child_files	A dictionary where the keys are the file names and the values are hash strings.  (One issue w/ using hashes is that you can't tie YOUR idea of the metadata, but it also allows a person to have metadata about a file that is private and not be forced to share it, nor create a dummy object.)
 child_files	A dictionary where the keys are the file names and the
 		values are hash strings.  (One issue w/ using hashes is
 		that you can't tie YOUR idea of the metadata, but it
 		also allows a person to have metadata about a file that
 		is private and not be forced to share it, nor create a
 		dummy object.)

 Opinion Properties:
 qualityrating	On a scale from 1 (poor/terrible) to 5 (great/pristine), the subjective quality of the content.

 The base object will contain all the data associated w/ the file (object).  The base set of data is based upon the [Dublin Core] specification, as it provides a nice starting point, and will provide a good mapping to other systems out there.

 There may be a link to another MetaData object from which this one is derived.  If there is, all the meta data from the derived object (and the ones it derives from) must be included, except for the ones that have been marked deleted, or were overridden.  When a property is marked as opinion, it should not be inherited.  If the new author agrees with the opinion, then they have to restate the opinion in their object.

 Custom properties must be preceded w/ a namespace.  The name space is name followed by colon, as is demonstrated above w/ dc for [Dublin Core].

 The link to the meta data object must include the version referenced, as the referenced object may change.  A three way merge may be needed when updating an object where the derived object has also been updated if the new information is wished to be used.

 If a property is imported from the blog itself, it is recommended to mark it as such via the granular marking, see X for more info on how to do this.

 Open Questions:  When meta data is "declassified", how do you maintain a link to the classified version?
 qualityrating	On a scale from 1 (poor/terrible) to 5 (great/pristine),
 		the subjective quality of the content.

 The base object will contain all the data associated w/ the file
 (object).  The base set of data is based upon the [Dublin Core]
 specification, as it provides a nice starting point, and will provide a
 good mapping to other systems out there.

 There may be a link to another MetaData object from which this one is
 derived.  If there is, all the meta data from the derived object (and
 the ones it derives from) must be included, except for the ones that
 have been marked deleted, or were overridden.  When a property is
 marked as opinion, it should not be inherited.  If the new author
 agrees with the opinion, then they have to restate the opinion in their
 object.

 Custom properties must be preceded w/ a namespace.  The name space is
 name followed by colon, as is demonstrated above w/ dc for [Dublin
 Core].

 The link to the meta data object must include the version referenced,
 as the referenced object may change.  A three way merge may be needed
 when updating an object where the derived object has also been updated
 if the new information is wished to be used.

 If a property is imported from the blog itself, it is recommended to
 mark it as such via the granular marking, see X for more info on how to
 do this.

 Open Questions:  When meta data is "declassified", how do you maintain
 a link to the classified version?

 ### File Object

 Properties:
 type		'file'
 uuid		UUIDv5  If the stats do not match, check hash, create a derivative blob object, possibly?
 uuid		UUIDv5  If the stats do not match, check hash, create a
 		derivative blob object, possibly?
 modified	date of last modification of the object
 stat		Stats for the file, modified time, file size, used to detect when file has been changed/modified.
 stat		Stats for the file, modified time, file size, used to
 		detect when file has been changed/modified.

 A file object references a blob Object, and contains information about the file name in the file system associated w/ the blob.  This is used to speed up looking up blob objects.
 A file object references a blob Object, and contains information about
 the file name in the file system associated w/ the blob.  This is used
 to speed up looking up blob objects.

 ### Container Object

 A container object references one or more File objects.  This is for representing containers such as zip or tar.gz files, but is also for BitTorrent hashes (event for single file torrents).
 A container object references one or more File objects.  This is for
 representing containers such as zip or tar.gz files, but is also for
 BitTorrent hashes (event for single file torrents).

 ### URL Object

@@ -116,7 +209,8 @@ Similar to the File Object, but for web resources.

 ## Links

 These are the edges that connect the nodes.  For the most part they do not contain any data.
 These are the edges that connect the nodes.  For the most part they do
 not contain any data.

 [//]: # (Do we for containers?  Shouldn't the File be unique, or if not, doesn't mater?)

@@ -128,16 +222,45 @@ The two linked nodes, required to be File Objects, are equivalent.

 # Open

 1. Fully embedded links or have a separate node object?  Embedded links have the advantage of being smaller, but require more structure in the parent object.  This structure is likely needed in some cases, such as albums w/ tracks, but some edges, such as a clip to a movie still needs to contain data (meaning not so much a node).  I'm leaning towards embedded for now, as this should make things easier, and often structure is needed.
 2. How to handle similar, but split meta data?  One person decides to make a simple meta data object for a scene from a movie, while another person makes a segment of that scene from the movie.  Should the segment object be a link between the two? or contain it's own proper data?  Some of this can be handled w/ an equivalent meta data object to link two meta datas as being the same.
 3. For quality, is this talking about the possible representation, or the actual "content"?  So, a VHS, or old analog over the air encoding may be crappy, but the movie content may be good.  We may want to do a multi layered approach (this is less than ideal due to complexities), where files can only link to info about that file, i.e. coding, format, resolution, and this meta data object links to one that is the actual content, i.e. movie w/ actors.  Or should this be done via overlay? i.e. someone creates a BluRay meta data object about a movie, and then the DVD overlays the DVD resolution and other info, w/ deleting properties that are not relevant.
 1. Fully embedded links or have a separate node object?  Embedded links
   have the advantage of being smaller, but require more structure in
   the parent object.  This structure is likely needed in some cases,
   such as albums w/ tracks, but some edges, such as a clip to a movie
   still needs to contain data (meaning not so much a node).  I'm
   leaning towards embedded for now, as this should make things easier,
   and often structure is needed.
 2. How to handle similar, but split meta data?  One person decides to
   make a simple meta data object for a scene from a movie, while
   another person makes a segment of that scene from the movie.  Should
   the segment object be a link between the two? or contain it's own
   proper data?  Some of this can be handled w/ an equivalent meta data
   object to link two meta datas as being the same.
 3. For quality, is this talking about the possible representation, or
   the actual "content"?  So, a VHS, or old analog over the air
   encoding may be crappy, but the movie content may be good.  We may
   want to do a multi layered approach (this is less than ideal due to
   complexities), where files can only link to info about that file,
   i.e. coding, format, resolution, and this meta data object links to
   one that is the actual content, i.e. movie w/ actors.  Or should
   this be done via overlay? i.e. someone creates a BluRay meta data
   object about a movie, and then the DVD overlays the DVD resolution
   and other info, w/ deleting properties that are not relevant.

 Ask cvoid:
 Should a file system reference point to the blob hash or the uuidv4 of the blog object?  blob hash requires a lookup, and maybe selection?  Maybe both?  To denote the selection.  Likely File Objects are going to be private, so internal optimization?  This will likely be different for URL objects as they are more public, where file system is often local only (unless on a shared, e.g. work, system).
 Should a file system reference point to the blob hash or the uuidv4 of
 the blog object?  blob hash requires a lookup, and maybe selection?
 Maybe both?  To denote the selection.  Likely File Objects are going to
 be private, so internal optimization?  This will likely be different
 for URL objects as they are more public, where file system is often
 local only (unless on a shared, e.g. work, system).

 # Settled / Likely Closed

 1. Does a track of a CD deserve it's own "meta data" object?  Thinking yes, as the track may be played on radio, etc.  And the Album object can point to the tracks.  This also helps solve the compilation problem as the artists and other details are easier to represent separately.
 1. Does a track of a CD deserve it's own "meta data" object?  Thinking
   yes, as the track may be played on radio, etc.  And the Album object
   can point to the tracks.  This also helps solve the compilation
   problem as the artists and other details are easier to represent
   separately.

 # Notes

@@ -149,7 +272,11 @@ The [Dublin Core] standard is also noted by [RFC5013].

 # Some thoughts

 DHT looks like a good option for finding things.  IPFS is a great option for storing the data, and allowing peers to find the data, but it does NOT provide a search solution.  It should be able to combine the hash tree crypto solution along w/ the DHT to provide a way to build up an index for a peice of data.
 DHT looks like a good option for finding things.  IPFS is a great
 option for storing the data, and allowing peers to find the data, but
 it does NOT provide a search solution.  It should be able to combine
 the hash tree crypto solution along w/ the DHT to provide a way to
 build up an index for a peice of data.

 Need to look at DSHT
 Thoughts:
@@ -160,10 +287,13 @@ For search, you need two functions:

 How to do lookup:
 1. Generate a hash of the search term: searchhash = hash(term)
 2. Do a query of this hash to find if there is an object at this location, and this hash will reference an object that contains the results.
 2. Do a query of this hash to find if there is an object at this
   location, and this hash will reference an object that contains the
   results.

 How to add an object:
 1. Do a lookup and fetch the object that contains all the current objects.
 1. Do a lookup and fetch the object that contains all the current
   objects.
 2. Update object w/ new object, and now publish this new object.


@@ -175,21 +305,32 @@ Adding valid hashes that don't have the proper term in them.
 When adding hashes, limit number of unverified hashes per block iteration.

 Issue is, how do accept that a new block is valid:
 Some items are attempted to be fetched (likely based upon generation) and validated.  Ones that are not validated are marked as suched, and after a period of time of remaining unvalidated are removed.
 New objects likely need to be validated in the immediately following block to help prevent bad growth.
 Likely there needs to be multiple "live" blocks that are intermingled.  This can be done via a simple LSFR + count, likely dependant upon number of updates and size and difficulty of validating objects.
 * Some items are attempted to be fetched (likely based upon generation)
  and validated.  Ones that are not validated are marked as suched, and
  after a period of time of remaining unvalidated are removed.
 * New objects likely need to be validated in the immediately following
  block to help prevent bad growth.
 * Likely there needs to be multiple "live" blocks that are
  intermingled.  This can be done via a simple LSFR + count, likely
  dependant upon number of updates and size and difficulty of
  validating objects.

 When updating, always check for n + 1 until n is not found.  When publishing, depending upon timeframe, select n where n is smallest but still means time parameter.
 When updating, always check for n + 1 until n is not found.  When
 publishing, depending upon timeframe, select n where n is smallest but
 still means time parameter.


 ----

 IPFS not sure if it has a final hash mapping system (was unable to find one), but each object in IPFS may have different multihash depending upon how blog/list is broken out.
 IPFS not sure if it has a final hash mapping system (was unable to find
 one), but each object in IPFS may have different multihash depending
 upon how blog/list is broken out.

 Need a mapping system like:
 sha256:XXX -> ipfs:XLYkgq61DYaa8Nh3cq1U7rLinSa7dSHQ16x

 This could/should include multiple other items? like maybe block level hashing, though if there's an ipfs, that provides it..
 This could/should include multiple other items? like maybe block level
 hashing, though if there's an ipfs, that provides it..

 Reference Info: