CYCLUS
hdf5_back.h
Go to the documentation of this file.
1 #ifndef CYCLUS_SRC_HDF5_BACK_H_
2 #define CYCLUS_SRC_HDF5_BACK_H_
3 
4 #include <map>
5 #include <set>
6 #include <string>
7 #include <sstream>
8 
9 #include "boost/filesystem.hpp"
10 
11 #include "hdf5.h"
12 #include "hdf5_hl.h"
13 #include "query_backend.h"
14 
15 namespace cyclus {
16 
17 /// An Recorder backend that writes data to an hdf5 file. Identically named
18 /// Datum objects have their data placed as rows in a single table.
19 ///
20 /// The HDF5 backend ensures that every column in its tables is represented
21 /// in the schema with a fixed size. This in turn ensures that the schema itself
22 /// is of a fixed size. This fixed size constraint applies even to variable length
23 /// (VL) data types (string, blob, vector, etc).
24 ///
25 /// Variable length data is handled in a special way to ensure a fixed length
26 /// column. The naive approach would be to set a maximum size based on the data
27 /// available. However, this is not truly a fixed length data type. Instead, the
28 /// HDF5 backend serves as an on-disk bidriectional hash map for each VL data type.
29 ///
30 /// A regular hash table applies a hash function to keys and stores the values based
31 /// on this hash. Keys are unique and values may be repeated for many keys. In a
32 /// bidirectional hash map the keys and values are both one-to-one and onto. This
33 /// makes storing a seperate hash redunant and since the key and hash are the same.
34 ///
35 /// The HDF5 backend uses the well-known SHA1 hash function as its keys for VL data.
36 /// This is because SHA1 is 5x the size (20 bytes) of standard 32-bit unsigned ints.
37 /// This provides a gigantic address space in which to store variable length data.
38 /// HDF5 provides two critical features that make an address space of this size
39 /// possible: multidiemnsional arrays and chunking.
40 ///
41 /// HDF5 easily supports 5D arrays. This allows us to use the SHA1 hash not only as
42 /// a key, but also a we can cast it to an index (len-5 array of unsigned ints) for
43 /// a 5D array. Furthermore, for such an array we can set the chunksize to a single
44 /// element ([1, 1, 1, 1, 1]). This allows us to have the full space available
45 /// ([UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX]) while only storing the
46 /// data that actually exists!
47 ///
48 /// In the table columns for VL data, the HDF5 backend stores the SHA1 as a length-5
49 /// array of unsigned ints. Looking up the associated value is simply a matter of using
50 /// this array as an index into a special data type array as above.
51 /// This has the added advantage of de-duplicating storage for identical entries.
52 ///
53 /// On disk the keys and values for a data type are stored as arrays named with
54 /// base data type and the string "Keys" and "Vals" appended respectively. For
55 /// instance, BLOB is stored in the arrays BlobKeys and BlobVals while VL_VECTOR_INT
56 /// is stored in the arrays VectorIntKeys and VectorIntVals.
57 ///
58 /// In memory, all active keys are stored in vlkeys_ private member of this class.
59 /// This maps the DbType to a set of the SHA1 digests. This is used to prevent
60 /// excessive writing of values to disk that already exist.
61 ///
62 /// The cost of the bidirectional hash map strategy is that the values need to be
63 /// looked up in a separate read() from that of the table itself. However, by
64 /// using VL data types users should expect a performance hit and this is one of
65 /// the more effiecient strategies.
66 ///
67 /// Another implicit problem with all hash mappings is the possibility of collision.
68 /// However, this is in practice impossible here. For SHA1, there is a 3.4e-13 chance
69 /// of having a single collission with 1e18 (a billion billion) entries.
70 ///
71 /// Still, if the address space of SHA1 ever becomes insufficient for some reason,
72 /// please move to a larger SHA value such as SHA224 or SHA256 or higher. Such a
73 /// migration is not anticipated but would be straighforward.
74 class Hdf5Back : public FullBackend {
75  public:
76  /// Creates a new backend writing data to the specified file.
77  ///
78  /// @param path the file to write to. If it exists, it will be overwritten.
79  Hdf5Back(std::string path);
80 
81  /// cleans up resources and closes the file.
82  virtual ~Hdf5Back();
83 
84  /// Closes and flushes the backend.
85  virtual void Close();
86 
87  virtual void Notify(DatumList data);
88 
89  virtual std::string Name();
90 
91  virtual inline void Flush() { H5Fflush(file_, H5F_SCOPE_GLOBAL); }
92 
93  virtual QueryResult Query(std::string table, std::vector<Cond>* conds);
94 
95  virtual std::map<std::string, DbTypes> ColumnTypes(std::string table);
96 
97  virtual std::list<ColumnInfo> Schema(std::string table);
98 
99  virtual std::set<std::string> Tables();
100 
101  private:
102  /// Creates a QueryResult from a table description.
103  QueryResult GetTableInfo(std::string title, hid_t dset, hid_t dt);
104 
105  /// Reads a table's column types into schemas_ if they aren't already there
106  /// \{
107  void LoadTableTypes(std::string title, hsize_t ncols, Datum *d);
108  void LoadTableTypes(std::string title, hid_t dset, hsize_t ncols);
109  /// \}
110 
111  /// Creates a fixed length HDF5 string type of length-n
112  hid_t CreateFLStrType(int n);
113 
114  /// Creates and initializes an hdf5 table with schema defined by d.
115  void CreateTable(Datum* d);
116 
117  /// Writes a group of Datum objects with the same title to their
118  /// corresponding hdf5 dataset.
119  void WriteGroup(DatumList& group);
120 
121  /// Fill a contiguous memory buffer with data from group for writing to an
122  /// hdf5 dataset.
123  void FillBuf(std::string title, char* buf, DatumList& group, size_t* sizes,
124  size_t rowsize);
125 
126  /// Read variable length data from the database.
127  /// @param rawkey the SHA1 digest key as a byte array.
128  /// @return the value indicated by this type at this location.
129  template <typename T, DbTypes U>
130  T VLRead(const char* rawkey);
131 
132  /// Writes a variable length data to its on-disk bidirectional hash map.
133  /// @param x the data to write.
134  /// @param dbtype the data type of x.
135  /// @return the key of x, which is a SHA1 hash as len-5 an array of ints.
136  /// \{
137  template <typename T, DbTypes U>
138  Digest VLWrite(const T& x);
139 
140  template <typename T, DbTypes U>
141  inline Digest VLWrite(const boost::spirit::hold_any* x) {
142  return VLWrite<T, U>(x->cast<T>());
143  }
144  /// \}
145 
146  template <DbTypes U>
147  void WriteToBuf(char* buf, std::vector<int>& shape, const boost::spirit::hold_any* a, size_t column);
148 
149  /// Gets an HDF5 reference dataset for a variable length datatype
150  /// If the dataset does not exist in the database, it will create it.
151  ///
152  /// @param dbtype the datatype to retrive
153  /// @param forkeys specifies whether to retrieve the keys (true) or
154  /// values (false) dataset, optional
155  /// @return the dataset identifier
156  hid_t VLDataset(DbTypes dbtype, bool forkeys);
157 
158  /// Appends a key to a variable length key dataset
159  ///
160  /// @param dset an open HDF5 dataset
161  /// @param dbtype the variable length data type
162  /// @param key the SHA1 digest to append
163  void AppendVLKey(hid_t dset, DbTypes dbtype, const Digest& key);
164 
165 
166  /// Inserts a variable length data into it value dataset
167  ///
168  /// @param dset an open HDF5 dataset
169  /// @param dbtype the variable length data type
170  /// @param key the SHA1 digest to append
171  /// @param buf the value or buffer to insert
172  /// \{
173  void InsertVLVal(hid_t dset, DbTypes dbtype, const Digest& key,
174  const std::string& val);
175  void InsertVLVal(hid_t dset, DbTypes dbtype, const Digest& key,
176  hvl_t buf);
177  /// \}
178 
179  /// Converts a value to a variable length buffer for HDF5.
180  /// \{
181  hvl_t VLValToBuf(const std::vector<int>& x);
182  hvl_t VLValToBuf(const std::vector<float>& x);
183  hvl_t VLValToBuf(const std::vector<double>& x);
184  hvl_t VLValToBuf(const std::vector<std::string>& x);
185  hvl_t VLValToBuf(const std::vector<cyclus::Blob>& x);
186  hvl_t VLValToBuf(const std::vector<boost::uuids::uuid>& x);
187  hvl_t VLValToBuf(const std::set<int>& x);
188  hvl_t VLValToBuf(const std::set<float>& x);
189  hvl_t VLValToBuf(const std::set<double>& x);
190  hvl_t VLValToBuf(const std::set<std::string>& x);
191  hvl_t VLValToBuf(const std::set<cyclus::Blob>& x);
192  hvl_t VLValToBuf(const std::set<boost::uuids::uuid>& x);
193  hvl_t VLValToBuf(const std::list<bool>& x);
194  hvl_t VLValToBuf(const std::list<int>& x);
195  hvl_t VLValToBuf(const std::list<float>& x);
196  hvl_t VLValToBuf(const std::list<double>& x);
197  hvl_t VLValToBuf(const std::list<std::string>& x);
198  hvl_t VLValToBuf(const std::list<cyclus::Blob>& x);
199  hvl_t VLValToBuf(const std::list<boost::uuids::uuid>& x);
200  hvl_t VLValToBuf(const std::map<int, bool>& x);
201  hvl_t VLValToBuf(const std::map<int, int>& x);
202  hvl_t VLValToBuf(const std::map<int, float>& x);
203  hvl_t VLValToBuf(const std::map<int, double>& x);
204  hvl_t VLValToBuf(const std::map<int, std::string>& x);
205  hvl_t VLValToBuf(const std::map<int, cyclus::Blob>& x);
206  hvl_t VLValToBuf(const std::map<int, boost::uuids::uuid>& x);
207  hvl_t VLValToBuf(const std::map<std::string, bool>& x);
208  hvl_t VLValToBuf(const std::map<std::string, int>& x);
209  hvl_t VLValToBuf(const std::map<std::string, float>& x);
210  hvl_t VLValToBuf(const std::map<std::string, double>& x);
211  hvl_t VLValToBuf(const std::map<std::string, std::string>& x);
212  hvl_t VLValToBuf(const std::map<std::string, cyclus::Blob>& x);
213  hvl_t VLValToBuf(const std::map<std::string, boost::uuids::uuid>& x);
214  hvl_t VLValToBuf(const std::map<std::pair<int, std::string>, double>& x);
215  hvl_t VLValToBuf(const std::map<std::string, std::vector<double>>& x);
216  hvl_t VLValToBuf(const std::map<std::string, std::map<int, double>>& x);
217  hvl_t VLValToBuf(const std::map<std::string, std::pair<double, std::map<int, double>>>& x);
218  hvl_t VLValToBuf(const std::map<int, std::map<std::string, double>>& x);
219  hvl_t VLValToBuf(const std::map<std::string, std::vector<std::pair<int, std::pair<std::string, std::string>>>>& x);
220  hvl_t VLValToBuf(const std::list<std::pair<int, int>>& x);
221  hvl_t VLValToBuf(const std::map<std::string, std::pair<std::string, std::vector<double>>>& x);
222  hvl_t VLValToBuf(const std::map<std::string, std::map<std::string, int>>& x);
223  hvl_t VLValToBuf(const std::vector<std::pair<std::pair<double, double>, std::map<std::string, double>>>& x);
224  hvl_t VLValToBuf(const std::vector<std::pair<int, std::pair<std::string, std::string>>>& x);
225  hvl_t VLValToBuf(const std::map<std::pair<std::string, std::string>, int>& x);
226  hvl_t VLValToBuf(const std::map<std::string, std::map<std::string, double>>& x);
227 
228 
229  /// \}
230 
231  /// Converts a variable length buffer to a value for HDF5.
232  /// \{
233  template <typename T>
234  T VLBufToVal(const hvl_t& buf);
235  /// \}
236 
237  /// Flag for whether the backend is closed or not.
238  bool closed_ = false;
239 
240  /// A class to help with hashing variable length datatypes
241  Sha1 hasher_;
242 
243  /// A reference to a database.
244  hid_t file_;
245  /// The HDF5 UUID type, 16 byte char string.
246  hid_t uuid_type_;
247  /// The HDF5 SHA1 type, len-5 int array.
248  hid_t sha1_type_;
249  /// The HDF5 variable length string type.
250  hid_t vlstr_type_;
251  /// The HDF5 Blob type, variable length string.
252  hid_t blob_type_;
253 
254  /// Variable length value chunk size and extent
255  static const hsize_t vlchunk_[CYCLUS_SHA1_NINT];
256 
257  /// Listing of types opened here so that we may close them.
258  std::set<hid_t> opened_types_;
259 
260  /// Stores the database's path+name, declared during construction.
261  std::string path_;
262 
263  /// Offsets in bytes of each column in the tables, note that Hdf5Back itself
264  /// owns the value pointers and deallocates them in the desturctor.
265  std::map<std::string, size_t*> col_offsets_;
266 
267  /// Size in bytes of each column in the tables, note that Hdf5Back itself
268  /// owns the value pointers and deallocates them in the desturctor.
269  std::map<std::string, size_t*> col_sizes_;
270 
271  /// Total size in bytes of the whole schema in the tables.
272  std::map<std::string, size_t> schema_sizes_;
273 
274  /// Backend database specific datatypes for each column in the tables.
275  /// Note that Hdf5Back itself owns the value pointers and deallocates them
276  /// in the desturctor.
277  std::map<std::string, DbTypes*> schemas_;
278 
279  /// Map of array name (eg StringVals, BlobVals) to the HDF5 id for the
280  /// cooresponding dataet for variable length data.
281  std::map<std::string, hid_t> vldatasets_;
282 
283  /// Map of database type to the cooresponding HDF5 datatype.
284  std::map<DbTypes, hid_t> vldts_;
285 
286  /// Map of database type to the set of current keys present in the database.
287  std::map<DbTypes, std::set<Digest> > vlkeys_;
288 };
289 
290 const hsize_t Hdf5Back::vlchunk_[CYCLUS_SHA1_NINT] = {1, 1, 1, 1, 1};
291 
292 } // namespace cyclus
293 
294 #endif // CYCLUS_SRC_HDF5_BACK_H_
virtual void Notify(DatumList data)
Used to pass a list of new/collected Datum objects.
Definition: hdf5_back.cc:77
Interface implemented by backends that support recording and querying.
Meta data and results of a query.
DbTypes
This is the master list of all supported database types.
Definition: query_backend.h:31
#define CYCLUS_SHA1_NINT
Definition: query_backend.h:22
virtual void Flush()
Flushes all buffered data in the backend to its final format/location.
Definition: hdf5_back.h:91
An Recorder backend that writes data to an hdf5 file.
Definition: hdf5_back.h:74
virtual std::list< ColumnInfo > Schema(std::string table)
Return information about all columns of a table.
Definition: hdf5_back.cc:15123
Used to specify and send a collection of key-value pairs to the Recorder for recording.
Definition: datum.h:15
The digest type for SHA1s.
virtual std::string Name()
Used to uniquely identify a backend - particularly if there are more than one in a simulation...
Definition: hdf5_back.cc:7662
virtual void Close()
Closes and flushes the backend.
Definition: hdf5_back.cc:41
Hdf5Back(std::string path)
Creates a new backend writing data to the specified file.
Definition: hdf5_back.cc:11
virtual std::map< std::string, DbTypes > ColumnTypes(std::string table)
Return a map of column names of the specified table to the associated database type.
Definition: hdf5_back.cc:15096
taken directly from OsiSolverInterface.cpp on 2/17/14 from https://projects.coin-or.org/Osi/browser/trunk.
Definition: agent.cc:14
virtual ~Hdf5Back()
cleans up resources and closes the file.
Definition: hdf5_back.cc:72
T const & cast() const
Definition: any.hpp:309
std::vector< Datum * > DatumList
Definition: rec_backend.h:12
virtual std::set< std::string > Tables()
Return a set of all table names currently in the database.
Definition: hdf5_back.cc:15155
virtual QueryResult Query(std::string table, std::vector< Cond > *conds)
Return a set of rows from the specificed table that match all given conditions.
Definition: hdf5_back.cc:163