CYCLUS
Loading...
Searching...
No Matches
hdf5_back.h
Go to the documentation of this file.
1#ifndef CYCLUS_SRC_HDF5_BACK_H_
2#define CYCLUS_SRC_HDF5_BACK_H_
3
4#include <map>
5#include <set>
6#include <string>
7#include <sstream>
8
9#include "boost/filesystem.hpp"
10
11#include "hdf5.h"
12#include "hdf5_hl.h"
13#include "query_backend.h"
14
15namespace cyclus {
16
17/// An Recorder backend that writes data to an hdf5 file. Identically named
18/// Datum objects have their data placed as rows in a single table.
19///
20/// The HDF5 backend ensures that every column in its tables is represented
21/// in the schema with a fixed size. This in turn ensures that the schema itself
22/// is of a fixed size. This fixed size constraint applies even to variable length
23/// (VL) data types (string, blob, vector, etc).
24///
25/// Variable length data is handled in a special way to ensure a fixed length
26/// column. The naive approach would be to set a maximum size based on the data
27/// available. However, this is not truly a fixed length data type. Instead, the
28/// HDF5 backend serves as an on-disk bidriectional hash map for each VL data type.
29///
30/// A regular hash table applies a hash function to keys and stores the values based
31/// on this hash. Keys are unique and values may be repeated for many keys. In a
32/// bidirectional hash map the keys and values are both one-to-one and onto. This
33/// makes storing a seperate hash redunant and since the key and hash are the same.
34///
35/// The HDF5 backend uses the well-known SHA1 hash function as its keys for VL data.
36/// This is because SHA1 is 5x the size (20 bytes) of standard 32-bit unsigned ints.
37/// This provides a gigantic address space in which to store variable length data.
38/// HDF5 provides two critical features that make an address space of this size
39/// possible: multidiemnsional arrays and chunking.
40///
41/// HDF5 easily supports 5D arrays. This allows us to use the SHA1 hash not only as
42/// a key, but also a we can cast it to an index (len-5 array of unsigned ints) for
43/// a 5D array. Furthermore, for such an array we can set the chunksize to a single
44/// element ([1, 1, 1, 1, 1]). This allows us to have the full space available
45/// ([UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX]) while only storing the
46/// data that actually exists!
47///
48/// In the table columns for VL data, the HDF5 backend stores the SHA1 as a length-5
49/// array of unsigned ints. Looking up the associated value is simply a matter of using
50/// this array as an index into a special data type array as above.
51/// This has the added advantage of de-duplicating storage for identical entries.
52///
53/// On disk the keys and values for a data type are stored as arrays named with
54/// base data type and the string "Keys" and "Vals" appended respectively. For
55/// instance, BLOB is stored in the arrays BlobKeys and BlobVals while VL_VECTOR_INT
56/// is stored in the arrays VectorIntKeys and VectorIntVals.
57///
58/// In memory, all active keys are stored in vlkeys_ private member of this class.
59/// This maps the DbType to a set of the SHA1 digests. This is used to prevent
60/// excessive writing of values to disk that already exist.
61///
62/// The cost of the bidirectional hash map strategy is that the values need to be
63/// looked up in a separate read() from that of the table itself. However, by
64/// using VL data types users should expect a performance hit and this is one of
65/// the more effiecient strategies.
66///
67/// Another implicit problem with all hash mappings is the possibility of collision.
68/// However, this is in practice impossible here. For SHA1, there is a 3.4e-13 chance
69/// of having a single collission with 1e18 (a billion billion) entries.
70///
71/// Still, if the address space of SHA1 ever becomes insufficient for some reason,
72/// please move to a larger SHA value such as SHA224 or SHA256 or higher. Such a
73/// migration is not anticipated but would be straighforward.
74class Hdf5Back : public FullBackend {
75 public:
76 /// Creates a new backend writing data to the specified file.
77 ///
78 /// @param path the file to write to. If it exists, it will be overwritten.
79 Hdf5Back(std::string path);
80
81 /// cleans up resources and closes the file.
82 virtual ~Hdf5Back();
83
84 /// Closes and flushes the backend.
85 virtual void Close();
86
87 virtual void Notify(DatumList data);
88
89 virtual std::string Name();
90
91 virtual inline void Flush() { H5Fflush(file_, H5F_SCOPE_GLOBAL); }
92
93 virtual QueryResult Query(std::string table, std::vector<Cond>* conds);
94
95 virtual std::map<std::string, DbTypes> ColumnTypes(std::string table);
96
97 virtual std::list<ColumnInfo> Schema(std::string table);
98
99 virtual std::set<std::string> Tables();
100
101 private:
102 /// Creates a QueryResult from a table description.
103 QueryResult GetTableInfo(std::string title, hid_t dset, hid_t dt);
104
105 /// Reads a table's column types into schemas_ if they aren't already there
106 /// \{
107 void LoadTableTypes(std::string title, hsize_t ncols, Datum *d);
108 void LoadTableTypes(std::string title, hid_t dset, hsize_t ncols);
109 /// \}
110
111 /// Creates a fixed length HDF5 string type of length-n
112 hid_t CreateFLStrType(int n);
113
114 /// Creates and initializes an hdf5 table with schema defined by d.
115 void CreateTable(Datum* d);
116
117 /// Writes a group of Datum objects with the same title to their
118 /// corresponding hdf5 dataset.
119 void WriteGroup(DatumList& group);
120
121 /// Fill a contiguous memory buffer with data from group for writing to an
122 /// hdf5 dataset.
123 void FillBuf(std::string title, char* buf, DatumList& group, size_t* sizes,
124 size_t rowsize);
125
126 /// Read variable length data from the database.
127 /// @param rawkey the SHA1 digest key as a byte array.
128 /// @return the value indicated by this type at this location.
129 template <typename T, DbTypes U>
130 T VLRead(const char* rawkey);
131
132 /// Writes a variable length data to its on-disk bidirectional hash map.
133 /// @param x the data to write.
134 /// @param dbtype the data type of x.
135 /// @return the key of x, which is a SHA1 hash as len-5 an array of ints.
136 /// \{
137 template <typename T, DbTypes U>
138 Digest VLWrite(const T& x);
139
140 template <typename T, DbTypes U>
141 inline Digest VLWrite(const boost::spirit::hold_any* x) {
142 return VLWrite<T, U>(x->cast<T>());
143 }
144 /// \}
145
146 template <DbTypes U>
147 void WriteToBuf(char* buf, std::vector<int>& shape, const boost::spirit::hold_any* a, size_t column);
148
149 /// Gets an HDF5 reference dataset for a variable length datatype
150 /// If the dataset does not exist in the database, it will create it.
151 ///
152 /// @param dbtype the datatype to retrive
153 /// @param forkeys specifies whether to retrieve the keys (true) or
154 /// values (false) dataset, optional
155 /// @return the dataset identifier
156 hid_t VLDataset(DbTypes dbtype, bool forkeys);
157
158 /// Appends a key to a variable length key dataset
159 ///
160 /// @param dset an open HDF5 dataset
161 /// @param dbtype the variable length data type
162 /// @param key the SHA1 digest to append
163 void AppendVLKey(hid_t dset, DbTypes dbtype, const Digest& key);
164
165
166 /// Inserts a variable length data into it value dataset
167 ///
168 /// @param dset an open HDF5 dataset
169 /// @param dbtype the variable length data type
170 /// @param key the SHA1 digest to append
171 /// @param buf the value or buffer to insert
172 /// \{
173 void InsertVLVal(hid_t dset, DbTypes dbtype, const Digest& key,
174 const std::string& val);
175 void InsertVLVal(hid_t dset, DbTypes dbtype, const Digest& key,
176 hvl_t buf);
177 /// \}
178
179 /// Converts a value to a variable length buffer for HDF5.
180 /// \{
181 hvl_t VLValToBuf(const std::vector<int>& x);
182 hvl_t VLValToBuf(const std::vector<float>& x);
183 hvl_t VLValToBuf(const std::vector<double>& x);
184 hvl_t VLValToBuf(const std::vector<std::string>& x);
185 hvl_t VLValToBuf(const std::vector<cyclus::Blob>& x);
186 hvl_t VLValToBuf(const std::vector<boost::uuids::uuid>& x);
187 hvl_t VLValToBuf(const std::set<int>& x);
188 hvl_t VLValToBuf(const std::set<float>& x);
189 hvl_t VLValToBuf(const std::set<double>& x);
190 hvl_t VLValToBuf(const std::set<std::string>& x);
191 hvl_t VLValToBuf(const std::set<cyclus::Blob>& x);
192 hvl_t VLValToBuf(const std::set<boost::uuids::uuid>& x);
193 hvl_t VLValToBuf(const std::list<bool>& x);
194 hvl_t VLValToBuf(const std::list<int>& x);
195 hvl_t VLValToBuf(const std::list<float>& x);
196 hvl_t VLValToBuf(const std::list<double>& x);
197 hvl_t VLValToBuf(const std::list<std::string>& x);
198 hvl_t VLValToBuf(const std::list<cyclus::Blob>& x);
199 hvl_t VLValToBuf(const std::list<boost::uuids::uuid>& x);
200 hvl_t VLValToBuf(const std::map<int, bool>& x);
201 hvl_t VLValToBuf(const std::map<int, int>& x);
202 hvl_t VLValToBuf(const std::map<int, float>& x);
203 hvl_t VLValToBuf(const std::map<int, double>& x);
204 hvl_t VLValToBuf(const std::map<int, std::string>& x);
205 hvl_t VLValToBuf(const std::map<int, cyclus::Blob>& x);
206 hvl_t VLValToBuf(const std::map<int, boost::uuids::uuid>& x);
207 hvl_t VLValToBuf(const std::map<std::string, bool>& x);
208 hvl_t VLValToBuf(const std::map<std::string, int>& x);
209 hvl_t VLValToBuf(const std::map<std::string, float>& x);
210 hvl_t VLValToBuf(const std::map<std::string, double>& x);
211 hvl_t VLValToBuf(const std::map<std::string, std::string>& x);
212 hvl_t VLValToBuf(const std::map<std::string, cyclus::Blob>& x);
213 hvl_t VLValToBuf(const std::map<std::string, boost::uuids::uuid>& x);
214 hvl_t VLValToBuf(const std::map<std::pair<int, std::string>, double>& x);
215 hvl_t VLValToBuf(const std::map<std::string, std::vector<double>>& x);
216 hvl_t VLValToBuf(const std::map<std::string, std::map<int, double>>& x);
217 hvl_t VLValToBuf(const std::map<std::string, std::pair<double, std::map<int, double>>>& x);
218 hvl_t VLValToBuf(const std::map<int, std::map<std::string, double>>& x);
219 hvl_t VLValToBuf(const std::map<std::string, std::vector<std::pair<int, std::pair<std::string, std::string>>>>& x);
220 hvl_t VLValToBuf(const std::list<std::pair<int, int>>& x);
221 hvl_t VLValToBuf(const std::map<std::string, std::pair<std::string, std::vector<double>>>& x);
222 hvl_t VLValToBuf(const std::map<std::string, std::map<std::string, int>>& x);
223 hvl_t VLValToBuf(const std::vector<std::pair<std::pair<double, double>, std::map<std::string, double>>>& x);
224 hvl_t VLValToBuf(const std::vector<std::pair<int, std::pair<std::string, std::string>>>& x);
225 hvl_t VLValToBuf(const std::map<std::pair<std::string, std::string>, int>& x);
226 hvl_t VLValToBuf(const std::map<std::string, std::map<std::string, double>>& x);
227
228
229 /// \}
230
231 /// Converts a variable length buffer to a value for HDF5.
232 /// \{
233 template <typename T>
234 T VLBufToVal(const hvl_t& buf);
235 /// \}
236
237 /// Flag for whether the backend is closed or not.
238 bool closed_ = false;
239
240 /// A class to help with hashing variable length datatypes
241 Sha1 hasher_;
242
243 /// A reference to a database.
244 hid_t file_;
245 /// The HDF5 UUID type, 16 byte char string.
246 hid_t uuid_type_;
247 /// The HDF5 SHA1 type, len-5 int array.
248 hid_t sha1_type_;
249 /// The HDF5 variable length string type.
250 hid_t vlstr_type_;
251 /// The HDF5 Blob type, variable length string.
252 hid_t blob_type_;
253
254 /// Variable length value chunk size and extent
255 static const hsize_t vlchunk_[CYCLUS_SHA1_NINT];
256
257 /// Listing of types opened here so that we may close them.
258 std::set<hid_t> opened_types_;
259
260 /// Stores the database's path+name, declared during construction.
261 std::string path_;
262
263 /// Offsets in bytes of each column in the tables, note that Hdf5Back itself
264 /// owns the value pointers and deallocates them in the desturctor.
265 std::map<std::string, size_t*> col_offsets_;
266
267 /// Size in bytes of each column in the tables, note that Hdf5Back itself
268 /// owns the value pointers and deallocates them in the desturctor.
269 std::map<std::string, size_t*> col_sizes_;
270
271 /// Total size in bytes of the whole schema in the tables.
272 std::map<std::string, size_t> schema_sizes_;
273
274 /// Backend database specific datatypes for each column in the tables.
275 /// Note that Hdf5Back itself owns the value pointers and deallocates them
276 /// in the desturctor.
277 std::map<std::string, DbTypes*> schemas_;
278
279 /// Map of array name (eg StringVals, BlobVals) to the HDF5 id for the
280 /// cooresponding dataet for variable length data.
281 std::map<std::string, hid_t> vldatasets_;
282
283 /// Map of database type to the cooresponding HDF5 datatype.
284 std::map<DbTypes, hid_t> vldts_;
285
286 /// Map of database type to the set of current keys present in the database.
287 std::map<DbTypes, std::set<Digest> > vlkeys_;
288};
289
290const hsize_t Hdf5Back::vlchunk_[CYCLUS_SHA1_NINT] = {1, 1, 1, 1, 1};
291
292} // namespace cyclus
293
294#endif // CYCLUS_SRC_HDF5_BACK_H_
Used to specify and send a collection of key-value pairs to the Recorder for recording.
Definition datum.h:15
The digest type for SHA1s.
Interface implemented by backends that support recording and querying.
An Recorder backend that writes data to an hdf5 file.
Definition hdf5_back.h:74
virtual void Close()
Closes and flushes the backend.
Definition hdf5_back.cc:41
virtual void Flush()
Flushes all buffered data in the backend to its final format/location.
Definition hdf5_back.h:91
virtual std::map< std::string, DbTypes > ColumnTypes(std::string table)
Return a map of column names of the specified table to the associated database type.
virtual void Notify(DatumList data)
Used to pass a list of new/collected Datum objects.
Definition hdf5_back.cc:77
virtual std::string Name()
Used to uniquely identify a backend - particularly if there are more than one in a simulation.
virtual std::list< ColumnInfo > Schema(std::string table)
Return information about all columns of a table.
virtual QueryResult Query(std::string table, std::vector< Cond > *conds)
Return a set of rows from the specificed table that match all given conditions.
Definition hdf5_back.cc:163
virtual ~Hdf5Back()
cleans up resources and closes the file.
Definition hdf5_back.cc:72
Hdf5Back(std::string path)
Creates a new backend writing data to the specified file.
Definition hdf5_back.cc:11
virtual std::set< std::string > Tables()
Return a set of all table names currently in the database.
Meta data and results of a query.
taken directly from OsiSolverInterface.cpp on 2/17/14 from https://projects.coin-or....
Definition agent.cc:14
std::vector< Datum * > DatumList
Definition rec_backend.h:12
DbTypes
This is the primary list of all supported database types.
T OptionalQuery(InfileTree *tree, std::string query, T default_val)
a query method for optional parameters
#define CYCLUS_SHA1_NINT