상세 컨텐츠

본문 제목

[SQL 스터디_1팀] 6주차_강의노트

심화 스터디/SQL 스터디

by j.hyeon 2023. 5. 17. 22:49

본문

[이미지 포함 정리 참고]

https://www.notion.so/ad93357c471d4a1f827533099904e9e6


Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

E

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

xploExploring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

ring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Exploring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Exploring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Exploring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Exploring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Exploring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

Exploring categorical data and unstructed text

Character data types and common issues

  • 길이에 따른 character type
    • character(n) or char(n) _ 길이 n으로 고정
    • character varying(n) or varchar(n) _ 최대 길이 n
    • text or varchar _ 길이 무제한
  • text data 유형
    • categorical: 값을 가지는 짧은 길이의 string이 반복
    • unstructed text: unique 값의 더 긴 string
      • 분석 _ text에서 기능 추출하거나, 특정 특성 존재 여부를 나타내는 변수 생성
  • categorical variables에 대한 grouping & counting
  • SELECT category, count(*) FROM product GROUP BY category;
  • Order
    • Alphabetical Order _ ‘ ‘ < ‘A’ < ‘B’ < ‘a’ < ‘b’
  • SELECT category, count(*) FROM product GROUP BY category ORDER BY count DESC; # 많은 순서대로 ORDER BY category; # categorical variable 순서대로

Cases and Spaces

  • convert case
SELECT lower('aBc DeFg 7-');
  • case insensitive comparisons
SELECT *
FROM fruit
WHERE lower(fav_fruit) = 'apple';
  • case insensitive searches
SELECT *
FROM fruit
WHERE fav_fruit ILIKE '%apple%';

# LIKE 사용할 경우 문자 그대로 값에만 적용
  • trimming spaces
    • trim(’ abc ‘) (== btrim(’ abc ‘)) = ‘abc’
    • rtrim(’ abc ‘) = ‘ abc’
    • lrim(’ abc ‘) = ‘abc ‘
  • trimming other values
  • SELECT trim('Wow!', '!'); # Wow SELECT trim('WoW!', '!wW'); #o
  • combining functions
  • SELECT trim(lower('Wow!'), '!w'); # o

Splitting and concatenating text

  • substring
SELECT left('abcde', 2),
				right('abcde', 2),
				
SELECT left('abc', 10),
				length(left('abc', 10));
  • SELECT substring(string FROM start FOR length);
    • SELECT substr(string, start, length); 결과 동일
  • delimiters
    • splitting _ SELECT split_part(string, delimiter, part);
    SELECT split_part('a,bc,d', ',', 2);
    # bc
    
  • concatenating text
    SELECT concat('a', NULL, 'cc');
    # acc
    SELECT 'a' || NULL || 'cc';
    # 
    
  • SELECT concat('a', 2, 'cc'); SELECT 'a' || 2 || 'cc'; # a2cc

Strategies for multiple transformations

  • CASE WHEN
SELECT CASE WHEN category LIKE '%: %' THEN split_part(category, ': ', 1)
						WHEN category LIKE '% - %' THEN split_part(category, '- ', 1)
						ELSE split_part(category, ' | ', 1)
				END AS major_category, sum(business)
FROM naics
GROUP BY major_category;
  • recoding table
    1. create temp table with original values
    CREATE TEMP TABLE recode AS
    SELECT DISTINCT fav_fruit AS original, 
    								fav_fruit AS standardized
    FROM fruit;
    
    1. update to create standardized values
    UPDATE recode
    SET standardized=trim(lower(original));
    
    UPDATE recode
    SET standardized='banana'
    WHERE standardized LIKE '%nn%';
    
    UPDATE recode
    SET standardized=trim(standardized, 's'));
    
    1. join original data to standardized data
    # original only
    SELECT fav_fruit, count(*)
    FROM fruit
    GROUP BY fav_fruit;
    
  • # with recoded values SELECT standardized, count(*) FROM fruit LEFT JOIN recode ON fav_fruit=original GROUP BY standardized;

Working with dates and timestamps

Date/time types and formats

  • date _ YYYY-MM-DD
  • timestamp _ YYYY-MM-DD HH:MM:SS
  • standards
    • timestamp with timezone
    • YYYY-MM-DD HH:MM:SS+HH
  • YYYY-MM-DD HH:MM:SS
  • comparisons _ >, <, =
    • current _ now()
  • subtraction, addition
SELECT now() - '2018-01-01'

SELECT '2010-01-01'::date + 1;
# 2010-01-02

SELECT '2018-12-10'::date + '1 year'::interval;
# 2019-12-10 00:00:00

SELECT '2018-12-10'::date + '1 year 2 days 3 minutes'::interval;
# 2019-12-12 00:00:03

Date/time components and aggregation

  • common data/time fields
    • century
    • decade
    • year, month, day
    • hour, minute, second
    • week
    • dow: day of week
  • extracting fields
date_part('field', timestamp)
EXTRACT(FIELD FROM timestamp)
SELECT date_part('month', now()),
			EXTRACT(MONTH FROM now());
  • extract to summarize by field
# individual sales
SELECT *
FROM sales
WHERE date >= '2010-01-01'
			AND date < '2017-01-01';
# by month
SELECT date_part('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;
  • truncating dates _ date_trunc(’field’, timestamp)
SELECT date_trunc('month', date) AS month, sum(amt)
FROM sales
GROUP BY month
ORDER BY month;

Aggregating with date/time series

  • generate series _ SELECT generate_series(from, to, interval)
    • from the beginning
    SELECT generate_series('2018-01-31', '2018-12-31', '1 month'::interval)
    
    • aggregation with series
    WITH hour_series AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 14:00:00',
    														'1 hour'::interval) AS hours)
    
    SELECT hours, count(data)
    FROM hour_series
    LEFT JOIN sales
    ON hours=date_Truc('hour', date)
    GROUP BY hours
    GROUP BY hours
    
    • aggregation with bins
    WITH bins AS(
    		SELECT generate_series('2018-04-23 09:00:00',
    														'2018-04-23 15:00:00',
    														'3 hour'::interval) AS lower,
    						generate_series('2018-04-23 12:00:00',
    														'2018-04-23 18:00:00',
    														'3 hour'::interval) AS upper)
    
    SELECT lower, upper, count(date)
    FROM bins
    LEFT JOIN sales
    ON date >= lower
    	AND date < upper
    GROUP BY lower, upper
    ORDER BY lower
    
  • SELECT generate_series('2018-02-01', '2019-01-01', '1 month'::interval) - '1 day'::interval;

Time between events

  • lead and log
SELECT date,
				lag(date) OVER (ORDER BY date),
				lead(date) OVER (ORDER BY date)
FROM sales;
  • time between events
SELECT date,
			date - lag(data) OVER (ORDER BY date) AS gap
FROM sales 

# average time
SELECT avg(gap)
FROM (SELECT date - lag(date) OVER (ORDER BY date) AS gap
	FROM sales) AS gaps;
  • change in a time series
SELECT date, amount, lag(amount) OVER (ORDER BY date),
							amount - lag(amount) OVER (ORDER BY date) AS chane

관련글 더보기

댓글 영역