IT학습/Library

[pandas] 데이터프레임 정보 보기, 수정하기, groupby

바틀비 2024. 1. 18. 04:11

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns

데이터프레임의 정보 파악하기¶

info(): 데이터프레임의 정보를 보여줌
- 전체 데이터와 특정 column의 데이터 개수를 비교하는 방식으로 결측치의 존재를 파악.
describe(): 기술 통계 데이터 확인
df[].value_count(): 특정 column의 각 값의 개수 파악

In [2]:

lemonade = pd.read_csv('data/Lemonade2016.csv')
lemonade.head(3)

Out[2]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price
0	7/1/2016	Park	97	67	70	90.0	0.25
1	7/2/2016	Park	98	67	72	90.0	0.25
2	7/3/2016	Park	110	77	71	104.0	0.25

In [3]:

lemonade.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         31 non-null     object 
 1   Location     32 non-null     object 
 2   Lemon        32 non-null     int64  
 3   Orange       32 non-null     int64  
 4   Temperature  32 non-null     int64  
 5   Leaflets     31 non-null     float64
 6   Price        32 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.9+ KB

RangeIndex: 32: 총 데이터의 개수는 32개이다.
Date 31 non-null: 31개의 데이터가 no-null -----> 1개의 결측치
Leaflets 31 non-null: 31개의 데이터가 no-null--> 1개의 결측치

In [4]:

#8번 인덱스에서 결측치 확인 가능
lemonade.head(10)

Out[4]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price
0	7/1/2016	Park	97	67	70	90.0	0.25
1	7/2/2016	Park	98	67	72	90.0	0.25
2	7/3/2016	Park	110	77	71	104.0	0.25
3	7/4/2016	Beach	134	99	76	98.0	0.25
4	7/5/2016	Beach	159	118	78	135.0	0.25
5	7/6/2016	Beach	103	69	82	90.0	0.25
6	7/6/2016	Beach	103	69	82	90.0	0.25
7	7/7/2016	Beach	143	101	81	135.0	0.25
8	NaN	Beach	123	86	82	113.0	0.25
9	7/9/2016	Beach	134	95	80	126.0	0.25

In [5]:

lemonade.describe()

Out[5]:

	Lemon	Orange	Temperature	Leaflets	Price
count	32.000000	32.000000	32.000000	31.000000	32.000000
mean	116.156250	80.000000	78.968750	108.548387	0.354688
std	25.823357	21.863211	4.067847	20.117718	0.113137
min	71.000000	42.000000	70.000000	68.000000	0.250000
25%	98.000000	66.750000	77.000000	90.000000	0.250000
50%	113.500000	76.500000	80.500000	108.000000	0.350000
75%	131.750000	95.000000	82.000000	124.000000	0.500000
max	176.000000	129.000000	84.000000	158.000000	0.500000

In [6]:

lemonade['Location'].value_counts()

Out[6]:

Location
Beach    17
Park     15
Name: count, dtype: int64

Location 데이터 중 Beach 데이터 17개
Location 데이터 중 Park 데이터 15개

데이터프레임의 데이터 수정하기¶

.drop(): column 제거
- drop_duplicates(): 중복값 제거
df['A'] = : 새로운 column 추가 및 수정
sort_values(): 데이터 정렬
Group by 연산: 소규모 행을 그룹화 하여 합계, 평균, 최댓값, 최솟값을 계산함.

In [7]:

lemonade.shape

Out[7]:

(32, 7)

In [8]:

#중복값을 제거

lemonade = lemonade.drop_duplicates(keep = 'first', ignore_index = True)
#keep first -> 최초는 유지
#ignore_index = True -> index 다시 세기
lemonade.shape #shape가 (32, 7) -> (31, 7)로 바뀜

Out[8]:

(31, 7)

In [9]:

#새로운 column 추가
#사칙연산 덧셈
lemonade['Sold'] = lemonade['Lemon'] +  lemonade['Orange']
lemonade.head()

Out[9]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold
0	7/1/2016	Park	97	67	70	90.0	0.25	164
1	7/2/2016	Park	98	67	72	90.0	0.25	165
2	7/3/2016	Park	110	77	71	104.0	0.25	187
3	7/4/2016	Beach	134	99	76	98.0	0.25	233
4	7/5/2016	Beach	159	118	78	135.0	0.25	277

In [10]:

#사칙연산 곱셈
lemonade['Sales'] = lemonade['Sold'] +  lemonade['Price']
lemonade.head()

Out[10]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Sales
0	7/1/2016	Park	97	67	70	90.0	0.25	164	164.25
1	7/2/2016	Park	98	67	72	90.0	0.25	165	165.25
2	7/3/2016	Park	110	77	71	104.0	0.25	187	187.25
3	7/4/2016	Beach	134	99	76	98.0	0.25	233	233.25
4	7/5/2016	Beach	159	118	78	135.0	0.25	277	277.25

In [11]:

#기존 컬럼 수정
lemonade['Location'] = lemonade["Location"].replace({"Park":"P", "Beach":"B"})
lemonade.head()

Out[11]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Sales
0	7/1/2016	P	97	67	70	90.0	0.25	164	164.25
1	7/2/2016	P	98	67	72	90.0	0.25	165	165.25
2	7/3/2016	P	110	77	71	104.0	0.25	187	187.25
3	7/4/2016	B	134	99	76	98.0	0.25	233	233.25
4	7/5/2016	B	159	118	78	135.0	0.25	277	277.25

In [12]:

#데이터 정렬
lemonade = lemonade.sort_values(by = ['Sales'], ascending = False).head()
#sort는 원본 데이터에 반영이 안 되기 때문에 새로운 데이터프레임에 drop한 결과를 넣어야 함

#by = ['Sales'] 컬럼에 따라
#ascending = False: 내림차순 정렬

lemonade.head()

Out[12]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Sales
25	7/26/2016	P	176	129	83	158.0	0.35	305	305.35
10	7/11/2016	B	162	120	83	135.0	0.25	282	282.25
4	7/5/2016	B	159	118	78	135.0	0.25	277	277.25
24	7/25/2016	P	156	113	84	135.0	0.50	269	269.50
6	7/7/2016	B	143	101	81	135.0	0.25	244	244.25

In [13]:

#column 제거
lemonade_2 = lemonade.drop('Temperature', axis = 1) 
#drop은 원본 데이터에 반영이 안 되기 때문에 새로운 데이터프레임에 drop한 결과를 넣어야 함

lemonade_2.head()

Out[13]:

	Date	Location	Lemon	Orange	Leaflets	Price	Sold	Sales
25	7/26/2016	P	176	129	158.0	0.35	305	305.35
10	7/11/2016	B	162	120	135.0	0.25	282	282.25
4	7/5/2016	B	159	118	135.0	0.25	277	277.25
24	7/25/2016	P	156	113	135.0	0.50	269	269.50
6	7/7/2016	B	143	101	135.0	0.25	244	244.25

In [14]:

#index 제거
#index 재정렬
lemonade = lemonade.sort_values(by = ['Sales'], ascending = False, ignore_index = True)

lemonade_3 = lemonade.drop(2, axis = 0)
lemonade_3.head() #인덱스 2가 전체 제거됨

Out[14]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Sales
0	7/26/2016	P	176	129	83	158.0	0.35	305	305.35
1	7/11/2016	B	162	120	83	135.0	0.25	282	282.25
3	7/25/2016	P	156	113	84	135.0	0.50	269	269.50
4	7/7/2016	B	143	101	81	135.0	0.25	244	244.25

In [17]:

lemonade.head()

Out[17]:

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Sales
0	7/26/2016	P	176	129	83	158.0	0.35	305	305.35
1	7/11/2016	B	162	120	83	135.0	0.25	282	282.25
2	7/5/2016	B	159	118	78	135.0	0.25	277	277.25
3	7/25/2016	P	156	113	84	135.0	0.50	269	269.50
4	7/7/2016	B	143	101	81	135.0	0.25	244	244.25

Group by 연산
- 소규모 column을 그룹화 하여 합계, 평균, 최댓값, 최솟값을 계산함
- 그룹화 기준 column: 패턴이 있는 문자열
- 그룹화 대상 column: 수치데이터
- 집계함수: 합, 평균, 표준편차, 최대, 최소 등
그룹화의 필요성
- 데이터의 유의미한 비교를 위해서
- 국내 국민 소득 수준 비교하기
- 행정동 수천개, 시군구 수백개를 비교하는 건 유의미하지 않음.
- 광역시도 단위 17개로 줄여서 비교하면 유의미함

In [18]:

lemonade.groupby('Location')['Sales'].agg(['sum'])
#Location을 기준으로
#Sales 데이터들을
#sum을 집계함.

Out[18]:

	sum
Location
B	803.75
P	574.85

In [19]:

lemonade.groupby('Location')['Sales'].agg(['sum', 'mean', 'std'])

Out[19]:

	sum	mean	std
Location
B	803.75	267.916667	20.647841
P	574.85	287.425000	25.349778

저작자표시 비영리 변경금지

'IT학습 > Library' 카테고리의 다른 글

[pandas] concat, join, merge (0)	2024.01.17
[pandas] csv파일 입출력, 데이터값 수정, concat/merge (0)	2024.01.17
[pandas] loc 기본, 기술통계량 (0)	2024.01.17
[pandas] DataFrame 기본연산, 기본 기술통계 (0)	2024.01.12
[pandas] Series 데이터 구조, 날짜 데이터 (0)	2024.01.12

현재글[pandas] 데이터프레임 정보 보기, 수정하기, groupby

알고리즘, git, Numpy, Anaconda, 가상환경, pandas, 플러그인, 웹크롤링, figma, matplotlib, Selenium, SciPy, 통계분석, seaborn, Python, jupyter,

Today :
Yesterday :

바틀비의 타자기

[pandas] 데이터프레임 정보 보기, 수정하기, groupby

데이터프레임의 정보 파악하기¶

데이터프레임의 데이터 수정하기¶

'IT학습 > Library' 카테고리의 다른 글

'IT학습/Library'의 다른글

티스토리툴바

« 2024/12 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

[pandas] 데이터프레임 정보 보기, 수정하기, groupby

데이터프레임의 정보 파악하기¶

데이터프레임의 데이터 수정하기¶

'IT학습 > Library' 카테고리의 다른 글

'IT학습/Library'의 다른글

관련글

티스토리툴바