Process and analyze
3.1. Processing large-scale data
source systems (Google Cloud Storage, Bigtable, Google Cloud SQL) 읽어온 큰 데이터를 처리하고 정규화하고 집계함. 데이터 양이 커서 클러스터로 분산 처리하거나 소프트웨어 툴들의 도움을 받음
Dataproc
cluster에서 실행 (auto scaling)
기존 Hadoop, Hive, Spark 에플리캐이션과 연동
use case: Log processing, Reporting, On-demand Spark clusters, Machine learning
Dataflow
Serverless(cluster X), parallel 처리 (No-Ops)
Spark나 Hadoop과 연동이 아닌 새로운 데이터 pipeline 처리
use case: MapReduce 안쓰는 parallel processing, User analytics, Data science, ETL, Log processing
Dataprep
UI-Driven Data Preparation (No-Ops, 필요시 scaling 가능) use case: Machine learning, Analytics
connect Dataprep to BigQuery
1) Create Flow 2) Add dataset (import) 3) Select BigQuery (left pane) & Create dataset 4) Import & Add to Flow
inspect, process data
1) Edit Recipe dataset의 sample을 Transformer view에서 확인 가능 (visualization)
execute job to load BigQuery
1) click Run in Transformer page
2) click Edit on Create-CSV
3) select BigQuery then create table
4) name output table
5) click Update
6) click Run
job history에서 monitoring 가능
3.2. Analyzing and querying data
BigQuery
앞서 store 할 때 뿐만 아니라 분석할 때도 사용
use case: User analysis(adtech, clickstream, game telemetry), Device and operational metrics(IoT), BI
Machine learning
처리된 결과를 확대시키거나 data-collection 최적화를 제공하기도 하고 결과 예측도 함
- 음성 인식
- 자연어 처리
- 번역
- 동영상 자동 분석
- AI 플랫폼 (TensorFlow)
Explore and visualize
4.1. Explore and visualize
Datalab
interactive한 web 기반 툴
pandas, numpy, scikit-learn 등의 다양한 toolkit 지원
Data science ecosystem
Datalab 말고도, web 기반 툴인 Apache Zeppelin 지원 R 사용하면 Rstudio Server나 Microsoft ML Server 지원 Scala나 Java 사용하면 Jupyter 지원
4.2. Visualizing business intelligence results
Looker
BI platform
BI Engine
analysis service 관리
Sheets
Spreadsheet visualization
Data Catalog
Data discovery and metadata management
Data Studio
Dashboarding and visualization
over 200 connectors ( Google Analytics, BigQuery, Sheets, and external data sources…)
Practice
1) Intro
- connect bigquery and data studio
- visualize my data in data studio
2) getting set up
- accesse public datasets ( catalog )
- set up my GCP ( my console )
3) connecting data studio and bigquery
(1) Open data studio
(2) Start with a Template (Blank template)
(3) Add a data to report Connect to BigQuery (Allow permission to view BigQuery data)
(4) Add table “san_francisco_311” (in public datasets)
(5) Click Manage added data sources under Resources
(6) Edit table fields of “311_service_requests”
(7) “latitude”,”longitude” field from text to Latitude, Longitude in Geo
4) creating visualizations
- Click “Add a chart” and select “Treemap”
- Place the “Treemap” chart and change the parameter (“category” field)
5) building a dashboard
- Click “Add a chart”, select “Google Maps” and change its parameters (Dimension/Tooltip : “neighborhood”, Metric/Bubble size: “record count”)
- Click “Add a chart” and select “Scorecard”
- Select “Add a filter” in the Data panel and create filter (“Street and Sidewalk cleaning” in “category” field)
6) creating filters
- Click “Filter Control” (set filter dimension to “neighborhood”)
- Click “Arrange” and select “Make page-level”
7) test it and share it