Why Do We Need Data Management?
Digital research, especially biological, creates larges amounts of complex data
- 3 Billion base-pairs in the 23 human chromosome pairs
- 20,000+ humangenes
- 60,000+ human protein variants
- measurement of the expression patterns of all these requires many files for each element of the biology central dogma
- Data complexity
- huge numbers of files, particularly counting the meta-data relating data-to-data
- large storage capacity needed, either for individual files or collectively
- This data varies in format and type
- raw text
- delimited text
- binary (not human readable)
- often extreme differences in storage requirements or limitations
- Without a management plan:
- protecting, sharing, and even locating data can be a challenge
- With a plan:
- researchers can focus on their areas of expertise
- management policies can be automated
- scientific replication, open-access, and cross-collections can be created, curated, and maintained
Example:
- Given 100 similarly named files in a directory
file0.dat - file100.dat
- DISCUSS: what can we guess about this data?
- very little
- DISCUSS: what can we guess about this data?
- Over these modules, we’ll discuss data organization best practices for improved computational efficiency, performance, and security.
Next: File Systems | UP: Data Management Overview | Top: Course Overview |