판다스(Pandas) - 2 : 데이터 selection 및 filtering

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Link

Github-page

관리 메뉴

UOMOP

판다스(Pandas) - 2 : 데이터 selection 및 filtering 본문

Summary/Ai

판다스(Pandas) - 2 : 데이터 selection 및 filtering

Happy PinGu 2022. 1. 18. 23:16

DataFrame의 []연산자

넘파이에서 []연산자는 행의 위, 열의 위치, 슬라이싱 범위 등을 지정해 데이터를 가져올 수 있다.
하지만 DataFrame 바로 뒤에 있는 '[]'안에 들어갈 수 있는 것은 컬럼 명 문자, 또는 인덱스로 변환 가능한 표현식이다.

import pandas as pd

titanic_df = pd.read_csv("titanic_train.csv")
print("단일 컬럼 데이터 추출 : \n{}".format(titanic_df["Pclass"].head(3)))
print("")
print("여러 컬럼들의 데이터 추출 : \n{}".format(titanic_df[["Survived", "Pclass"]].head(3)))
print("")
print("[] 안에 숫자 index는 KeyError 오류 발생 : {}".format(titanic_df[0]))

titanic_df[ titanic_df["Pclass"] == 3 ].head(3)

DataFrame iloc[ ]연산자 : 위치기반 인덱싱

data = {"Name" : ["Dowon", "Junho", "Bomi", "Youngsu"],
       "Year" : [2011, 2016, 2015, 2015],
       "Gender" : ["Male", "Male", "Female", "Male"]}
data_df = pd.DataFrame(data, index = ["one", "two", "three", "four"])
data_df

data_df.iloc[0, 0]

'Dowon'

data_df.iloc[0, 1]

2011

DataFrame loc[ ] 연산자 : 명칭기반 인덱싱

data_df

data_df.loc["two", "Year"]

2016

data_df.loc["four", "Gender"]

'Male'

iloc[ ], loc[ ]를 사용한 slicing

data_df

print("위치 기반 인덱싱 iloc[ ]로 slice! : \n{}".format(data_df.iloc[0:3, 1]))

print("-"*35)

print("명칭 기반 인덱싱 loc[ ]로 slice! : \n{}".format(data_df.loc["one" : "three", "Name"]))

불린 인덱싱(Boolean indexing)

헷갈리는 위치기반, 명칭기반 인덱싱을 사용할 필요없이 조건식을 [ ]안에 기입하여 간편하게 필터링을 수행

titanic_df = pd.read_csv("titanic_train.csv")
titanic_boolean = titanic_df[ titanic_df["Age"] > 60 ]

print(type(titanic_boolean))
titanic_boolean

<class 'pandas.core.frame.DataFrame'>

titanic_df["Age"] > 60

var1 = titanic_df["Age"] > 60
print(type(var1))

<class 'pandas.core.series.Series'>

titanic_df[titanic_df["Age"] > 60][["Name", "Age"]].head(3)

titanic_df[["Name", "Age"]][titanic_df["Age"] > 60].head(3)

titanic_df.loc[titanic_df["Age"] > 60, ["Name", "Age"]].head(3)

titanic_df[ (titanic_df["Age"] > 60) & (titanic_df["Pclass"] == 1) & (titanic_df["Sex"] == "female") ]

논리 연산자로 결합된 조건식도 불린 인덱싱으로 적용 가능

cond1 = titanic_df["Age"] > 60
cond2 = titanic_df["Pclass"] == 1
cond3 = titanic_df["Sex"] == "female"

titanic_df[cond1 & cond2 & cond3]

조건식은 변수로도 할당 가능. 복잡한 조건식은 변수로 할당하여 가독성을 향상

Aggregation 함수

titanic_df.count()

titanic_df[["Fare", "Age"]].mean()

Fare 32.204208
Age 29.699118
dtype: float64

titanic_df[["Age", "Fare"]].sum()

Age 21205.1700
Fare 28693.9493
dtype: float64

titanic_df[["Age", "Fare"]].sum(axis=1)

titanic_df[["Age", "Fare"]].count()

Age 714
Fare 891
dtype: int64

groupby()

groupby() by 인자에 Group By 하고자 하는 컬럼을 입력, 여러개의 컬럼으로 Group By하고자 하면 [ ] 내에 해당 컬럼명 입력.
DataFrame에 groupby()를 호출하면 DataFrameGroupBy 객체를 반환.

titanic_groupby = titanic_df.groupby(by = "Pclass")
print(type(titanic_groupby))
print(titanic_groupby)

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000260EAE65610>

titanic_groupby= titanic_df.groupby("Pclass").count()
titanic_groupby

위 출력 결과에서 Pclass는 index이다.

print(type(titanic_groupby))
print(titanic_groupby.shape)
print(titanic_groupby.index)

<class 'pandas.core.frame.DataFrame'>
(3, 11)
Int64Index([1, 2, 3], dtype='int64', name='Pclass')

titanic_groupby = titanic_df.groupby(by = "Pclass")[["PassengerId", "Survived"]].count()
titanic_groupby

titanic_df[["Pclass", "PassengerId", "Survived"]].groupby("Pclass").count()

titanic_df.groupby("Pclass")["Pclass"].count()

Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64

titanic_df["Pclass"].value_counts()

3 491
1 216
2 184
Name: Pclass, dtype: int64

titanic_df.groupby("Pclass")["Age"].agg([max, min])

agg_format = {"Age" : "max", "SibSp" : "sum", "Fare" : "mean"}
titanic_df.groupby("Pclass").agg(agg_format)

Missing 데이터 처리하기

DataFrame의 isna() 메소드는 모든 컬럼 값들이 NaN인지 True/False값을 반환(NaN이면 True)

isna( )

titanic_df.isna().head(3)

NaN인 곳만 True가 된다

titanic_df.isna().sum()

컬럼 별로 NaN이 몇 개인지

fillna( )

titanic_df["Cabin"] = titanic_df["Cabin"].fillna("0000")
titanic_df.head(3)

titanic_df["Age"] = titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")
titanic_df.isna().sum()

apply lambda 식으로 데이터 가공

파이썬 lambda식 기본

def get_square(a) :
    return a**2

print("3의 제곱은 {}".format(get_square(3)))

3의 제곱은 9

lambda_square = lambda x : x**2
print("3의 제곱은 {}".format(lambda_square(3)))

3의 제곱은 9

a = [1, 2, 3]
squares = map(lambda x : x**2, a)
list(squares)

[1, 4, 9]

판다스에 apply lambda식 적용

titanic_df["Name_len"] = titanic_df["Name"].apply(lambda x : len(x))
titanic_df[["Name", "Name_len"]].head(3)

titanic_df["Child_Adult"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else "Adult")
titanic_df[["Age", "Child_Adult"]].head(10)

titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : "Child" if x <= 15 else("Adult" if x <= 60 else "Elderly"))

titanic_df["Age_cat"].value_counts()

Adult 786
Child 83
Elderly 22
Name: Age_cat, dtype: int64

def get_category(age):
    cat = ""
    if age <= 5 : cat = "Baby"
    elif age <= 12 : cat = "Child"
    elif age <= 18 : cat = "Teen"
    elif age <= 25 : cat = "Student"
    elif age <= 35 : cat = "Young Adult"
    elif age <= 60 : cat = "Adult"
    else : cat = "Elderly"
        
    return cat

titanic_df["Age_cat"] = titanic_df["Age"].apply(lambda x : get_category(x))
titanic_df[["Age", "Age_cat"]].head()

'Summary > Ai' 카테고리의 다른 글

판다스(Pandas) - 1 (0)	2022.01.18
넘파이(Numpy) (0)	2022.01.18

'Summary/Ai' Related Articles

Comments

UOMOP

판다스(Pandas) - 2 : 데이터 selection 및 filtering 본문

판다스(Pandas) - 2 : 데이터 selection 및 filtering

DataFrame의 []연산자

DataFrame iloc[ ]연산자 : 위치기반 인덱싱

DataFrame loc[ ] 연산자 : 명칭기반 인덱싱

iloc[ ], loc[ ]를 사용한 slicing

불린 인덱싱(Boolean indexing)

Aggregation 함수

groupby()

Missing 데이터 처리하기

isna( )

fillna( )

apply lambda 식으로 데이터 가공

파이썬 lambda식 기본

판다스에 apply lambda식 적용

'Summary > Ai' 카테고리의 다른 글

티스토리툴바