Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Panel

df = pd.read_json( json.dumps(hotellist_sc.collect() ) )
df

addrSummary	availableRooms	category	discount	grade	hotelIdx	latitude	longitude	name	regionName	reviewCount
0	7	hotel	169900	special1	234	30.1	30.1	메이필드호텔	서울	223

여기서 의미 있는 값이 무엇인지? 필드명을 파악합니다.

필요하면 엑셀로도 저장해줍니다.

df.to_excel('output.xls', index=False)

Spark RDD 기본 API

SQL문을 사용하지 않고, RDD API를 사용하는이유

필터-비싼호텔 찾기

No Format
expesiveHotel = hotellist_sc.filter( lambda row : row['discount'] > 300000 ) #전체를 다 가져오거나? expesiveHotel.collect() #특정 개수만 획득 expesiveHotel.take(5) ==> 데이터 예 [{'addrSummary': '강남 \| 삼성역 도보 1분', 'availableRooms': 5, 'displayText': '서울 > 파크 하얏트 서울', 'distance': 0.0, 'districtName': '강남구',

No Format

expesiveHotel = hotellist_sc.filter( lambda row : row['discount'] > 300000 )
#전체를 다 가져오거나?
expesiveHotel.collect()
#특정 개수만 획득
expesiveHotel.take(5)


==> 데이터 예
[{'addrSummary': '강남 | 삼성역 도보 1분',
  'availableRooms': 5,
  'displayText': '서울 > 파크 하얏트 서울',
  'distance': 0.0,
  'districtName': '강남구',

...

Content

Space Tools

Versions Compared

Old Version 7

New Version 8

Key

Spark RDD 기본 API

SQL문을 사용하지 않고, RDD API를 사용하는이유

필터-비싼호텔 찾기