CYCLUS
hdf5_back.h
Go to the documentation of this file.
1 #ifndef CYCLUS_SRC_HDF5_BACK_H_
2 #define CYCLUS_SRC_HDF5_BACK_H_
3 
4 #include <map>
5 #include <set>
6 #include <string>
7 #include <sstream>
8 
9 #include "boost/filesystem.hpp"
10 
11 #include "hdf5.h"
12 #include "hdf5_hl.h"
13 #include "query_backend.h"
14 
15 namespace cyclus {
16 
17 /// An Recorder backend that writes data to an hdf5 file. Identically named
18 /// Datum objects have their data placed as rows in a single table.
19 ///
20 /// The HDF5 backend ensures that every column in its tables is represented
21 /// in the schema with a fixed size. This in turn ensures that the schema itself
22 /// is of a fixed size. This fixed size constraint applies even to variable length
23 /// (VL) data types (string, blob, vector, etc).
24 ///
25 /// Variable length data is handled in a special way to ensure a fixed length
26 /// column. The naive approach would be to set a maximum size based on the data
27 /// available. However, this is not truly a fixed length data type. Instead, the
28 /// HDF5 backend serves as an on-disk bidriectional hash map for each VL data type.
29 ///
30 /// A regular hash table applies a hash function to keys and stores the values based
31 /// on this hash. Keys are unique and values may be repeated for many keys. In a
32 /// bidirectional hash map the keys and values are both one-to-one and onto. This
33 /// makes storing a seperate hash redunant and since the key and hash are the same.
34 ///
35 /// The HDF5 backend uses the well-known SHA1 hash function as its keys for VL data.
36 /// This is because SHA1 is 5x the size (20 bytes) of standard 32-bit unsigned ints.
37 /// This provides a gigantic address space in which to store variable length data.
38 /// HDF5 provides two critical features that make an address space of this size
39 /// possible: multidiemnsional arrays and chunking.
40 ///
41 /// HDF5 easily supports 5D arrays. This allows us to use the SHA1 hash not only as
42 /// a key, but also a we can cast it to an index (len-5 array of unsigned ints) for
43 /// a 5D array. Furthermore, for such an array we can set the chunksize to a single
44 /// element ([1, 1, 1, 1, 1]). This allows us to have the full space available
45 /// ([UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX]) while only storing the
46 /// data that actually exists!
47 ///
48 /// In the table columns for VL data, the HDF5 backend stores the SHA1 as a length-5
49 /// array of unsigned ints. Looking up the associated value is simply a matter of using
50 /// this array as an index into a special data type array as above.
51 /// This has the added advantage of de-duplicating storage for identical entries.
52 ///
53 /// On disk the keys and values for a data type are stored as arrays named with
54 /// base data type and the string "Keys" and "Vals" appended respectively. For
55 /// instance, BLOB is stored in the arrays BlobKeys and BlobVals while VL_VECTOR_INT
56 /// is stored in the arrays VectorIntKeys and VectorIntVals.
57 ///
58 /// In memory, all active keys are stored in vlkeys_ private member of this class.
59 /// This maps the DbType to a set of the SHA1 digests. This is used to prevent
60 /// excessive writing of values to disk that already exist.
61 ///
62 /// The cost of the bidirectional hash map strategy is that the values need to be
63 /// looked up in a separate read() from that of the table itself. However, by
64 /// using VL data types users should expect a performance hit and this is one of
65 /// the more effiecient strategies.
66 ///
67 /// Another implicit problem with all hash mappings is the possibility of collision.
68 /// However, this is in practice impossible here. For SHA1, there is a 3.4e-13 chance
69 /// of having a single collission with 1e18 (a billion billion) entries.
70 ///
71 /// Still, if the address space of SHA1 ever becomes insufficient for some reason,
72 /// please move to a larger SHA value such as SHA224 or SHA256 or higher. Such a
73 /// migration is not anticipated but would be straighforward.
74 class Hdf5Back : public FullBackend {
75  public:
76  /// Creates a new backend writing data to the specified file.
77  ///
78  /// @param path the file to write to. If it exists, it will be overwritten.
79  Hdf5Back(std::string path);
80 
81  /// cleans up resources and closes the file.
82  virtual ~Hdf5Back();
83 
84  /// Closes and flushes the backend.
85  virtual void Close();
86 
87  virtual void Notify(DatumList data);
88 
89  virtual std::string Name();
90 
91  virtual inline void Flush() { H5Fflush(file_, H5F_SCOPE_GLOBAL); }
92 
93  virtual QueryResult Query(std::string table, std::vector<Cond>* conds);
94 
95  virtual std::map<std::string, DbTypes> ColumnTypes(std::string table);
96 
97  virtual std::set<std::string> Tables();
98 
99  private:
100  /// Creates a QueryResult from a table description.
101  QueryResult GetTableInfo(std::string title, hid_t dset, hid_t dt);
102 
103  /// Reads a table's column types into schemas_ if they aren't already there
104  /// \{
105  void LoadTableTypes(std::string title, hsize_t ncols);
106  void LoadTableTypes(std::string title, hid_t dset, hsize_t ncols);
107  /// \}
108 
109  /// Creates a fixed length HDF5 string type of length-n
110  hid_t CreateFLStrType(int n);
111 
112  /// Creates and initializes an hdf5 table with schema defined by d.
113  void CreateTable(Datum* d);
114 
115  /// Writes a group of Datum objects with the same title to their
116  /// corresponding hdf5 dataset.
117  void WriteGroup(DatumList& group);
118 
119  /// Fill a contiguous memory buffer with data from group for writing to an
120  /// hdf5 dataset.
121  void FillBuf(std::string title, char* buf, DatumList& group, size_t* sizes,
122  size_t rowsize);
123 
124  /// Read variable length data from the database.
125  /// @param rawkey the SHA1 digest key as a byte array.
126  /// @return the value indicated by this type at this location.
127  template <typename T, DbTypes U>
128  T VLRead(const char* rawkey);
129 
130  /// Writes a variable length data to its on-disk bidirectional hash map.
131  /// @param x the data to write.
132  /// @param dbtype the data type of x.
133  /// @return the key of x, which is a SHA1 hash as len-5 an array of ints.
134  /// \{
135  template <typename T, DbTypes U>
136  Digest VLWrite(const T& x);
137 
138  template <typename T, DbTypes U>
139  inline Digest VLWrite(const boost::spirit::hold_any* x) {
140  return VLWrite<T, U>(x->cast<T>());
141  }
142  /// \}
143 
144  /// Gets an HDF5 reference dataset for a variable length datatype
145  /// If the dataset does not exist in the database, it will create it.
146  ///
147  /// @param dbtype the datatype to retrive
148  /// @param forkeys specifies whether to retrieve the keys (true) or
149  /// values (false) dataset, optional
150  /// @return the dataset identifier
151  hid_t VLDataset(DbTypes dbtype, bool forkeys);
152 
153  /// Appends a key to a variable length key dataset
154  ///
155  /// @param dset an open HDF5 dataset
156  /// @param dbtype the variable length data type
157  /// @param key the SHA1 digest to append
158  void AppendVLKey(hid_t dset, DbTypes dbtype, const Digest& key);
159 
160 
161  /// Inserts a variable length data into it value dataset
162  ///
163  /// @param dset an open HDF5 dataset
164  /// @param dbtype the variable length data type
165  /// @param key the SHA1 digest to append
166  /// @param buf the value or buffer to insert
167  /// \{
168  void InsertVLVal(hid_t dset, DbTypes dbtype, const Digest& key,
169  const std::string& val);
170  void InsertVLVal(hid_t dset, DbTypes dbtype, const Digest& key,
171  hvl_t buf);
172  /// \}
173 
174  /// Converts a value to a variable length buffer for HDF5.
175  /// \{
176  hvl_t VLValToBuf(const std::vector<int>& x);
177  hvl_t VLValToBuf(const std::vector<float>& x);
178  hvl_t VLValToBuf(const std::vector<double>& x);
179  hvl_t VLValToBuf(const std::vector<std::string>& x);
180  hvl_t VLValToBuf(const std::set<int>& x);
181  hvl_t VLValToBuf(const std::set<std::string>& x);
182  hvl_t VLValToBuf(const std::list<int>& x);
183  hvl_t VLValToBuf(const std::list<std::string>& x);
184  hvl_t VLValToBuf(const std::map<int, int>& x);
185  hvl_t VLValToBuf(const std::map<int, double>& x);
186  hvl_t VLValToBuf(const std::map<int, std::string>& x);
187  hvl_t VLValToBuf(const std::map<std::string, int>& x);
188  hvl_t VLValToBuf(const std::map<std::string, double>& x);
189  hvl_t VLValToBuf(const std::map<std::string, std::string>& x);
190  hvl_t VLValToBuf(const std::map<std::pair<int, std::string>, double>& x);
191  /// \}
192 
193  /// Converts a variable length buffer to a value for HDF5.
194  /// \{
195  template <typename T>
196  T VLBufToVal(const hvl_t& buf);
197  /// \}
198 
199  /// Flag for whether the backend is closed or not.
200  bool closed_ = false;
201 
202  /// A class to help with hashing variable length datatypes
203  Sha1 hasher_;
204 
205  /// A reference to a database.
206  hid_t file_;
207  /// The HDF5 UUID type, 16 byte char string.
208  hid_t uuid_type_;
209  /// The HDF5 SHA1 type, len-5 int array.
210  hid_t sha1_type_;
211  /// The HDF5 variable length string type.
212  hid_t vlstr_type_;
213  /// The HDF5 Blob type, variable length string.
214  hid_t blob_type_;
215 
216  /// Variable length value chunk size and extent
217  static const hsize_t vlchunk_[CYCLUS_SHA1_NINT];
218 
219  /// Listing of types opened here so that we may close them.
220  std::set<hid_t> opened_types_;
221 
222  /// Stores the database's path+name, declared during construction.
223  std::string path_;
224 
225  /// Offsets in bytes of each column in the tables, note that Hdf5Back itself
226  /// owns the value pointers and deallocates them in the desturctor.
227  std::map<std::string, size_t*> col_offsets_;
228 
229  /// Size in bytes of each column in the tables, note that Hdf5Back itself
230  /// owns the value pointers and deallocates them in the desturctor.
231  std::map<std::string, size_t*> col_sizes_;
232 
233  /// Total size in bytes of the whole schema in the tables.
234  std::map<std::string, size_t> schema_sizes_;
235 
236  /// Backend database specific datatypes for each column in the tables.
237  /// Note that Hdf5Back itself owns the value pointers and deallocates them
238  /// in the desturctor.
239  std::map<std::string, DbTypes*> schemas_;
240 
241  /// Map of array name (eg StringVals, BlobVals) to the HDF5 id for the
242  /// cooresponding dataet for variable length data.
243  std::map<std::string, hid_t> vldatasets_;
244 
245  /// Map of database type to the cooresponding HDF5 datatype.
246  std::map<DbTypes, hid_t> vldts_;
247 
248  /// Map of database type to the set of current keys present in the database.
249  std::map<DbTypes, std::set<Digest> > vlkeys_;
250 };
251 
252 const hsize_t Hdf5Back::vlchunk_[CYCLUS_SHA1_NINT] = {1, 1, 1, 1, 1};
253 
254 } // namespace cyclus
255 
256 #endif // CYCLUS_SRC_HDF5_BACK_H_
virtual void Notify(DatumList data)
Used to pass a list of new/collected Datum objects.
Definition: hdf5_back.cc:77
Interface implemented by backends that support recording and querying.
Meta data and results of a query.
DbTypes
This is the master list of all supported database types.
Definition: query_backend.h:26
#define CYCLUS_SHA1_NINT
Definition: query_backend.h:17
virtual void Flush()
Flushes all buffered data in the backend to its final format/location.
Definition: hdf5_back.h:91
An Recorder backend that writes data to an hdf5 file.
Definition: hdf5_back.h:74
Used to specify and send a collection of key-value pairs to the Recorder for recording.
Definition: datum.h:15
The digest type for SHA1s.
virtual std::string Name()
Used to uniquely identify a backend - particularly if there are more than one in a simulation...
Definition: hdf5_back.cc:1561
virtual void Close()
Closes and flushes the backend.
Definition: hdf5_back.cc:41
T const & cast() const
Definition: any.hpp:309
Hdf5Back(std::string path)
Creates a new backend writing data to the specified file.
Definition: hdf5_back.cc:11
virtual std::map< std::string, DbTypes > ColumnTypes(std::string table)
Return a map of column names of the specified table to the associated database type.
Definition: hdf5_back.cc:2634
taken directly from OsiSolverInterface.cpp on 2/17/14 from https://projects.coin-or.org/Osi/browser/trunk.
Definition: agent.cc:14
virtual ~Hdf5Back()
cleans up resources and closes the file.
Definition: hdf5_back.cc:72
std::vector< Datum * > DatumList
Definition: rec_backend.h:12
virtual std::set< std::string > Tables()
Return a set of all table names currently in the database.
Definition: hdf5_back.cc:2661
virtual QueryResult Query(std::string table, std::vector< Cond > *conds)
Return a set of rows from the specificed table that match all given conditions.
Definition: hdf5_back.cc:163