Updating Existing Data¶

Ingestion된 데이터를 저장하고 사용하다가 데이터를 업데이트하고 싶을 때

Updating Dimension Values¶

Dimension 값을 자주 바꾸길 원한다면 Lookup을 사용할 수 있다.

Rebuilding Segments(Reindexing)¶

특정 기간에 대해서 세그먼트를 다시 저장할 수 있다. 기존 세그먼트에 컬럼을 추가하거나, 롤업 granularity를 바꾸고 싶을 때 사용한다.

이 방법을 사용할 경우에 대비하여 원시 데이터의 사본을 보관해야한다.

Dealing with Delayed Events(Delta Ingestion)¶

Batch Ingestion이 되었지만 그 후에 지연된 이벤트가 들어왔을 때 새로 인덱싱을 하는것 대신, Delta Ingestion을 사용해 기존 세그먼트에 추가할 수 있다.

Reindexing and Delta Ingestion with Hadoop Batch Ingestion¶

inputSpec¶

inputSpect을 사용해서 Ingestion할 때 어디서 데이터를 읽어올지 결정할 수 있음

{
    "type": "dataSource",
    "ingestionSpec": { ... },
	"maxSplitSize": 0
}

Field	Description	Required
type	`dataSource` 문자열로	O
ingestionSpec	로드될 세그먼트에 대한 설정. 아래 설명	O
maxSplitSize	Enables combining multiple segments into single Hadoop InputSplit according to size of segments. With -1, druid calculates max split size based on user specified number of map task(mapred.map.tasks or mapreduce.job.maps). By default, one split is made for one segment.	X

ingestionSpec¶

{
    "dataSource": "",
    "intervals": ["...", ],
    "segments": [{ ... }, ],
    "filter": { ... },
    "dimensions": ["...", ],
    "metrics": ["...", ],
    "ignoreWhenNoSegments": true
}

Field	Description	Required
dataSource	로드할 DataSource의 이름	O
intervals	ISO-8601 Intervals.	O
segments	읽을 `segment`의 리스트. default는 자동으로 정해줌. 지정하고 싶다면 `Coordinator` 에 `POST /druid/coordinator/v1/metadata/datasources/segments?full` 로 페이로드에 Interval(`["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]`) 을 담아서 보내면 된다. 멱등성을 보장하기 위한 필드라 함. `segments`를 넣어주는 것을 `STRONGLY RECOMMENDED!`	X
filter	Filter 설정	X
dimensions	로드될 `dimension`의 리스트. default는 `parseSpec`에서 가져옴. 만약 `parseSpec`이 없으면 모든 dimensions을 가져옴.	X
metrics	로드될 `metric`의 리스트. default는 `aggregator` 의 이름을 가져옴.	X
ignoreWhenNoSegments	세그먼트가 없으면, ingestionSpec을 무시할 지 여부. default는 세그먼트가 발견되지 않으면 오류 발생.	X

Multi inputSpec¶

여러 inputSpec을 사용하는 경우 type 필드를 multi로 설정. Delta Ingestion에서 사용할 수 있음.

하지만 children의 dataSource 인 inputSpec은 하나만 가질 수 있음

"ioConfig" : {
  "type" : "hadoop",
  "inputSpec" : {
    "type" : "multi",
    "children": [
      {
		"type": "dataSource",
		"intervals": [""]
      },
      {
        "type" : "static",
        "paths": "/path/to/more/wikipedia/data/"
      }
    ]  
  },
  ...
}