IT Share you

데이터 프레임을 여러 데이터 프레임으로 분할

shareyou 2020. 11. 16. 23:11

데이터 프레임을 여러 데이터 프레임으로 분할

실험 데이터 (응답자 60 명)가 포함 된 매우 큰 데이터 프레임 (약 1 백만 행)이 있습니다. 데이터 프레임을 60 개의 데이터 프레임 (각 참가자에 대한 데이터 프레임)으로 나누고 싶습니다.

데이터 프레임 (= 데이터라고 함)에는 각 참가자의 고유 코드 인 '이름'이라는 변수가 있습니다.

다음을 시도했지만 아무 일도 일어나지 않습니다 (또는 한 시간 내에 중지되지 않음). 내가하려는 것은 데이터 프레임 (데이터)을 더 작은 데이터 프레임으로 분할하고 목록 (데이터 목록)에 추가하는 것입니다.

import pandas as pd

def splitframe(data, name='name'):

    n = data[name][0]

    df = pd.DataFrame(columns=data.columns)

    datalist = []

    for i in range(len(data)):
        if data[name][i] == n:
            df = df.append(data.iloc[i])
        else:
            datalist.append(df)
            df = pd.DataFrame(columns=data.columns)
            n = data[name][i]
            df = df.append(data.iloc[i])

    return datalist

오류 메시지가 표시되지 않고 스크립트가 영원히 실행되는 것 같습니다!

그것을하는 현명한 방법이 있습니까?

첫째, 새 항목을위한 공간이 부족할 때 목록을 주기적으로 늘려야하기 때문에 행별로 목록에 추가하는 것이 느리기 때문에 접근 방식이 비효율적입니다. 크기가 결정되면 목록 이해가 더 좋습니다. 한 번 할당됩니다.

그러나 기본적으로 데이터 프레임이 이미 있으므로 접근 방식이 약간 낭비라고 생각하는데 왜 이러한 사용자 각각에 대해 새 데이터 프레임을 생성합니까?

열별로 데이터 프레임을 정렬하고 'name'인덱스를 this로 설정하고 필요한 경우 열을 삭제하지 않습니다.

그런 다음 모든 고유 항목의 목록을 생성 한 다음 이러한 항목을 사용하여 조회를 수행 할 수 있습니다. 결정적으로 데이터 만 쿼리하는 경우 선택 기준을 사용하여 값 비싼 데이터 복사없이 데이터 프레임에 대한보기를 반환합니다.

그래서:

# sort the dataframe
df.sort(columns=['name'], inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'

편집하다

sort이제 더 이상 사용되지 않으므로 지금 사용해야합니다 sort_values.

# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'

데이터 프레임을 슬라이스하여 왜 그렇게하지 않는지 물어볼 수 있습니까? 같은 것

#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})

#create unique list of names
UniqueNames = data.Names.unique()

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in UniqueNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names == key]

헤이 presto 당신은 당신이 원하는 것처럼 (내 생각에) 데이터 프레임의 사전을 가지고 있습니다. 하나에 액세스해야합니까? 그냥 입력

DataFrameDict['Joe']

도움이되는 희망

groupby객체를 tuples다음으로 변환 할 수 있습니다 dict.

df = pd.DataFrame({'Name':list('aabbef'),
                   'A':[4,5,4,5,5,4],
                   'B':[7,8,9,4,2,3],
                   'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])

print (df)
  Name  A  B  C
0    a  4  7  1
1    a  5  8  3
2    b  4  9  5
3    b  5  4  7
4    e  5  2  1
5    f  4  3  0

d = dict(tuple(df.groupby('Name')))
print (d)
{'b':   Name  A  B  C
2    b  4  9  5
3    b  5  4  7, 'e':   Name  A  B  C
4    e  5  2  1, 'a':   Name  A  B  C
0    a  4  7  1
1    a  5  8  3, 'f':   Name  A  B  C
5    f  4  3  0}

print (d['a'])
  Name  A  B  C
0    a  4  7  1
1    a  5  8  3

권장 되지는 않지만 그룹별로 DataFrame을 생성 할 수 있습니다.

for i, g in df.groupby('Name'):
    globals()['df_' + str(i)] =  g

print (df_a)
  Name  A  B  C
0    a  4  7  1
1    a  5  8  3

Groupby는 다음과 같은 이점을 제공합니다.

grouped = data.groupby(['name'])

그런 다음 각 참가자의 데이터 프레임과 같이 각 그룹과 함께 작업 할 수 있습니다. 그리고 (apply, transform, aggregate, head, first, last)와 같은 DataFrameGroupBy 개체 메서드는 DataFrame 개체를 반환합니다.

또는 목록을 만들고 grouped색인별로 모든 DataFrame을 가져올 수 있습니다 .

l_grouped = list(grouped)

l_grouped[0][1] -이름이있는 첫 번째 그룹에 대한 DataFrame.

쉬운:

[v for k, v in df.groupby('name')]

Gusev Slava의 답변 외에도 groupby의 그룹을 사용할 수 있습니다.

{key: df.loc[value] for key, value in df.groupby("name").groups.items()}

그러면 해당 파티션을 가리키는 그룹화 한 키가있는 사전이 생성됩니다. 장점은 키가 유지되고 목록 인덱스에서 사라지지 않는다는 것입니다.

In [28]: df = DataFrame(np.random.randn(1000000,10))

In [29]: df
Out[29]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0    1000000  non-null values
1    1000000  non-null values
2    1000000  non-null values
3    1000000  non-null values
4    1000000  non-null values
5    1000000  non-null values
6    1000000  non-null values
7    1000000  non-null values
8    1000000  non-null values
9    1000000  non-null values
dtypes: float64(10)

In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]

In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop

In [32]: len(frames)
Out[32]: 16667

여기에 groupby 방법이 있습니다 (합산 대신 임의 적용을 할 수 있음)

In [9]: g = df.groupby(lambda x: x/60)

In [8]: g.sum()    

Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0    16667  non-null values
1    16667  non-null values
2    16667  non-null values
3    16667  non-null values
4    16667  non-null values
5    16667  non-null values
6    16667  non-null values
7    16667  non-null values
8    16667  non-null values
9    16667  non-null values
dtypes: float64(10)

Sum은 cythonized 그래서 이것이 빠른 이유입니다

In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop

In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop

List groupbyComprehension 기반 방법 -분할 된 모든 데이터 프레임을 List 변수에 저장하며 인덱스를 이용하여 접근 할 수 있습니다.

예

ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]

ans[0]
ans[0].column_name

데이터에 대한 레이블이 이미있는 경우 groupby 명령을 사용할 수 있습니다.

 out_list = [group[1] for group in in_series.groupby(label_series.values)]

다음은 자세한 예입니다.

레이블을 사용하여 pd 시리즈를 청크 목록으로 분할하려고한다고 가정 해 보겠습니다. 예를 들면 다음과 in_series같습니다.

2019-07-01 08:00:00   -0.10
2019-07-01 08:02:00    1.16
2019-07-01 08:04:00    0.69
2019-07-01 08:06:00   -0.81
2019-07-01 08:08:00   -0.64
Length: 5, dtype: float64

그리고 그에 상응하는 label_series것은 :

2019-07-01 08:00:00   1
2019-07-01 08:02:00   1
2019-07-01 08:04:00   2
2019-07-01 08:06:00   2
2019-07-01 08:08:00   2
Length: 5, dtype: float64

운영

out_list = [group[1] for group in in_series.groupby(label_series.values)]

which returns out_list a list of two pd.Series:

[2019-07-01 08:00:00   -0.10
2019-07-01 08:02:00   1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00    0.69
2019-07-01 08:06:00   -0.81
2019-07-01 08:08:00   -0.64
Length: 3, dtype: float64]

Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day

I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.

This is the head of the dataframe:

I have created two lists; one for the names of dataframes and one for the couple of array [item_number, store_number].

    list=[]
    for i in range(1,len(items)*len(stores)+1):
    global list
    list.append('df'+str(i))

    list_couple_s_i =[]
    for item in items:
          for store in stores:
                  global list_couple_s_i
                  list_couple_s_i.append([item,store])

And once the two lists are ready you can loop on them to create the dataframes you want:

         for name, it_st in zip(list,list_couple_s_i):
                   globals()[name] = df.where((df['item']==it_st[0]) & 
                                                (df['store']==(it_st[1])))
                   globals()[name].dropna(inplace=True)

In this way I have created 500 dataframes.

Hope this will be helpful!

참고URL : https://stackoverflow.com/questions/19790790/splitting-dataframe-into-multiple-dataframes

'IT Share you' 카테고리의 다른 글

Handlebars.js는 [Object object] 대신 객체를 구문 분석합니다. (0)	2020.11.16
날짜 / 시간 범위에서 데이터 선택 (0)	2020.11.16
콘솔 애플리케이션에서 ELMAH 사용 (0)	2020.11.16
BackgroundWorker의 처리되지 않은 예외 (0)	2020.11.16
"토스트"는 무엇을 의미합니까? (0)	2020.11.16

현재글데이터 프레임을 여러 데이터 프레임으로 분할

shareyou

데이터 프레임을 여러 데이터 프레임으로 분할

데이터 프레임을 여러 데이터 프레임으로 분할

편집하다

'IT Share you' 카테고리의 다른 글

'IT Share you'의 다른글

티스토리툴바

데이터 프레임을 여러 데이터 프레임으로 분할

데이터 프레임을 여러 데이터 프레임으로 분할

편집하다

'IT Share you' 카테고리의 다른 글

'IT Share you'의 다른글

관련글

티스토리툴바