狭义理解, 元数据是指数据组织时所产生的外部信息, 而非数据本身的内容, 例如关系型数据库中某个表的表结构Schema(字段、字段类型、 备注);
- Disk Usage 数据的使用情况(存储空间)
- Lineage (血缘关系)
- Script and User(Owner) (生产业务脚本与维护者)
- Security(Permission) (谁有权限看、使用)
- Layers (数据层次)
- ... ...
如果一开始不知道要收集哪些信息, 那么最好采取: 越多越好 的思路进行平台设计。
提升数据质量这六个字不光是说说而已,需要有一个集中管理的工具, 而且随着各业务线的沿伸,越来越多的工具集扩增,数据仓库小组不能仅仅负责产生数据, 还应担负起数据质量的管理者, 因此对外、对内都需要这样的一个平台来查询数据的元信息。
本文将以Linkedin开源的 WhereHows作为介绍, 后面如有机会还会介绍 Apache Atlas、 Cloudera Navigator Data Management
一言蔽之， WhereHows是用于集成不同的数据来源产生的数据元信息 (metadata)， 通过汇总而得的 integrated metadata 来实现各类分析场景。掌握脉络分类 Where + How 下面看看它的架构设计图。
On a high level view, WhereHows system can be divided into three parts:
- data model
- backend ETL
- Web Service
We have periodical jobs fetching metadata different systems that is transformed and stored in the WhereHows repository. These jobs can be scheduled through AKKA scheduler, which can be run independently or scheduled through any other schedulers. The storage is a MySQL Database. Also a UI and APIs to interact with users is provided.
Four data components:
- Datasets: Dataset schema, comments, sample data ...
- Operational data: flow group, flow, job, job dependency, execution, id map tables (helper tables to generate unique IDs)
- Lineage data: input and output of jobs, dependency, partition, high watermark
- WhereHows ETL and Web UI/Service: configurations
The datasets and operation data are just like two endpoints, and the lineage data is the bridge that connects the two nodes. In this way, a user can either start from a dataset to explore which jobs produce or consume it, or start from a job to check what datasets it reads from or writes to.
One key point of the design of the data model is we want to keep it as general as possible, so datasets, operation data, and lineage data from different system can fit in the same schema, and query by the same strategy.
Technology Stacks : Java + Jython + MySQL
Java + Jython as major language.
Dataset metadata is collected and stored into WhereHows repository.
Dataset metadata includes schema, path, partition, sample data, high watermark, and so on.
(Currently, both Hadoop and Teradata dataset metadata covered. )
HDFS: a specific program that scans all the folders in a whitelist, collecting the information on dataset level. The program can be run remotely on a Hadoop gateway and copies back the results into the WhereHows database. Note that we need an account with a high read permission to read all datasets. (In my case, the account is "hdfs")
Teradata: query from DBC tables.
Operation data ETL
Operational data include: flow group definition, flow definition, job definition, ownership info, schedule info, flow execution info, job execution info.
Operation data ETL is a periodic process that fetches this information from the scheduler database to WhereHows databases. This uses a standard data model, which ensures that there is a unified way to find operation information.
* Pain point 1: Different scheduler systems have different data models. So the transformation is needed. For example, Azkaban is executed on a flow level and it has a unique key for flow execution, but it does not have a unique key for job execution. On the other hand, Oozie has a uuid for each job execution.
* Pain point 2: Many source metadata systems do not track version/revision, or track version at the less intuitive level. WhereHows uses the mapping logic to track the changes at more human-friendly level: For example, WhereHows derived flow version: certain schedulers, such as Oozie, do not keep a version of a flow, and even Azkaban lacks a flow level version.
Lineage information refers to the dependency of jobs and datasets. It also includes some operation data, such as whether this is a read or write operation, and how many records it can read/write.
Quick Start with VM