GitHub数据提取与分析

谁是你关注领域的大牛,哪个是你领域内的最火项目,当前Hot Language是什么

Posted by T.L on October 2, 2016

GitHub作为全球最大的代码托管平台,每小时都有成千上万个项目产生,他为开源作出了不可磨灭的贡献。本文使用了NetworkX对GitHub的进行图形分析,通过gitHub的丰富数据,构建可以在各种不同的方式使用数据模型。这里将github用户、代码、仓库构建成兴趣图。本文包含三方面:

  • github 开发者平台和对应的api
  • 如何使用NetworkX作图
  • 构建github的兴趣图
  • 图算法

了解github的API

同Twitter和Facebook一样,第一步就是获取Git自带的API。地址:https://developer.github.com/v3/
其中大部分我们并不需要,我们仅仅关注的是用户和仓库。所有的API访问通过HTTPS,并从https://api.github.com访问。所有发送和接收的数据是JSON。

创建API连接

github实现了OAuth接口,在拥有github账户后获取API通道的方式有两种。一种叫做personal access token.你可以为你自己的使用或实现的Web流创建一个个人访问令牌,以允许其他用户授权你的应用程序。一种是OAuth application,所有的开发人员在开始之前都需要注册他们的应用程序。注册OAuth应用分配一个唯一的客户ID和客户的密钥。这里简单起见,采用personal access token。点击新建即可生成有对应权限的token。生成好的token如下所示:

向api根目录发送请求,试一试这个token连接情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
curl https://api.github.com/?access_token=$TOKEN
{
  "current_user_url": "https://api.github.com/user",
  "current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}",
  "authorizations_url": "https://api.github.com/authorizations",
  "code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}",
  "emails_url": "https://api.github.com/user/emails",
  "emojis_url": "https://api.github.com/emojis",
  "events_url": "https://api.github.com/events",
  "feeds_url": "https://api.github.com/feeds",
  "followers_url": "https://api.github.com/user/followers",
  "following_url": "https://api.github.com/user/following{/target}",
  "gists_url": "https://api.github.com/gists{/gist_id}",
  "hub_url": "https://api.github.com/hub",
  "issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}",
  "issues_url": "https://api.github.com/issues",
  "keys_url": "https://api.github.com/user/keys",
  "notifications_url": "https://api.github.com/notifications",
  "organization_repositories_url": "https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}",
  "organization_url": "https://api.github.com/orgs/{org}",
  "public_gists_url": "https://api.github.com/gists/public",
  "rate_limit_url": "https://api.github.com/rate_limit",
  "repository_url": "https://api.github.com/repos/{owner}/{repo}",
  "repository_search_url": "https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}",
  "current_user_repositories_url": "https://api.github.com/user/repos{?type,page,per_page,sort}",
  "starred_url": "https://api.github.com/user/starred{/owner}{/repo}",
  "starred_gists_url": "https://api.github.com/gists/starred",
  "team_url": "https://api.github.com/teams",
  "user_url": "https://api.github.com/users/{user}",
  "user_organizations_url": "https://api.github.com/user/orgs",
  "user_repositories_url": "https://api.github.com/users/{user}/repos{?type,page,per_page,sort}",
  "user_search_url": "https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"
}

github 的API符合HATEOAS设计,从上面可以看到,如果想获取当前用户的信息,应该去访问api.github.com/user,然后就得到了下面结果。

1
2
3
4
5
6
7
8
9
{
"login": "luzhijun",
"id": 15256911,
"avatar_url": "https://avatars.githubusercontent.com/u/15256911?v=3",
"gravatar_id": "",
"url": "https://api.github.com/users/luzhijun",
"html_url": "https://github.com/luzhijun",
...
}

pygithub

ok,这里介绍一个方便python使用的包,pygithub,你可以通过它方便地用python脚本管理github。其API与github对应。 举个例子,拿到指定用户的所有仓库:

1
2
3
4
5
6
7
8
from github import Github
# Specify your own access token here
ACCESS_TOKEN = ''
USER = 'luzhijun'
client = Github(ACCESS_TOKEN)
user = client.get_user(USER)
REPOS=user.get_repos()
print(list(REPOS))

[Repository(full_name=”luzhijun/huxblog-boilerplate”), Repository(full_name=”luzhijun/leecode”), Repository(full_name=”luzhijun/luzhijun.github.io”), Repository(full_name=”luzhijun/Optimization”), Repository(full_name=”luzhijun/SVDRecommenderSystem”)]

如何使用NetworkX作图

了解NetworkX

NetworkX是一个用Python语言开发的图论与复杂网络建模工具,内置了常用的图与复杂网络分析算法,可以方便的进行复杂网络数据分析、仿真建模等工作。
下面创建一个有向图x->y:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import networkx as nx

# 创建有向图
g = nx.DiGraph()
# 加条边x->y
g.add_edge('X', 'Y')
# 打印图的相关数据信息
print (nx.info(g),'\n')

print ("Nodes:", g.nodes())
print ("Edges:", g.edges())
# 节点属性
print ("X props:", g.node['X'])
print ("Y props:", g.node['Y'])
# 边属性
print ("X=>Y props:", g['X']['Y'])
# 更新节点信息
g.node['X'].update({'prop1' : 'value1'})
print ("X props:", g.node['X'])
# 更新边信息
g['X']['Y'].update({'label' : 'label1'})
print ("X=>Y props:", g['X']['Y'])

Name:
Type: DiGraph #无向图表示为Graph Number of nodes: 2
Number of edges: 1
Average in degree: 0.5000
Average out degree: 0.5000

Nodes: [‘Y’, ‘X’]
Edges: [(‘X’, ‘Y’)]
X props: {}
Y props: {}
X=>Y props: {}
X props: {‘prop1’: ‘value1’}
X=>Y props: {‘label’: ‘label1’}

有向图和无向图都可以给边赋予权重,用到的方法是add_weighted_edges_from,它接受1个或多个三元组[u,v,w]作为参数,其中u是起点,v是终点,w是权重。例如:

1
2
g.add_weighted_edges_from([('X','Y',10.0)])
print (g.get_edge_data('X','Y'))

{‘weight’: 10.0, ‘label’: ‘label1’}

NetworkX提供了常用的图论经典算法,例如DFS、BFS、最短路、最小生成树、最大流等等,非常丰富,如果不做复杂网络,只作图论方面的工作,也可以应用NetworkX作为基本的开发包,更多用法参照networkx

使用NetworkX构建兴趣图

兴趣图和社交网络图是有区别的,最大的不同就是兴趣图节点代表的不一定是同一类的东西。有个项目是我加星的,那么这个兴趣图就是:
我–(gazes)–>项目
那么还有其他很多人也关注了这个项目,我们把加星的人都找出来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from github import Github
import networkx as nx

ACCESS_TOKEN = ''
USER='minrk'
REPO='findspark'
client = Github(ACCESS_TOKEN)
user = client.get_user(USER)
repo=user.get_repo(REPO)
stargazers=list(repo.get_stargazers())#加星的用户集合

g = nx.DiGraph()
g.add_node(repo.name + '(repo)', type='repo', lang=repo.language, owner=user.login)

for sg in stargazers:
    g.add_node(sg.login + '(user)', type='user')
    g.add_edge(sg.login + '(user)', repo.name + '(repo)', type='gazes')
# 打印图的基本属性   
print (nx.info(g),'\n')
# 打印项目和用户点的基本属性
print (g.node['findspark(repo)'])
print (g.node['luzhijun(user)'],'\n')
# 打印这条边属性
print (g['luzhijun(user)']['findspark(repo)'])
# 打印起点为XXX的信息
print (g['luzhijun(user)'])
print (g['findspark(repo)'])
# 打印用户的出入度信息
print (g.in_edges(['luzhijun(user)']))
print (g.out_edges(['luzhijun(user)']))
# 打印项目的出入度信息
print (g.in_edges(['findspark(repo)']))
print (g.out_edges(['findspark(repo)']))

Name:
Type: DiGraph
Number of nodes: 81
Number of edges: 80
Average in degree: 0.9877 # 80/81 Average out degree: 0.9877 # 80/81

{‘lang’: ‘Python’, ‘type’: ‘repo’, ‘owner’: ‘minrk’}
{‘type’: ‘user’}

{‘type’: ‘gazes’}
{‘findspark(repo)’: {‘type’: ‘gazes’}}
{}
[]
[(‘luzhijun(user)’, ‘findspark(repo)’)]
[(‘lendenmc(user)’, ‘findspark(repo)’), (‘nehalecky(user)’, ‘findspark(repo)’), (‘charsmith(user)’, ‘findspark(repo)’), (‘cwharland(user)’, ‘findspark(repo)’), (‘d3borah(user)’, ‘findspark(repo)’), (‘cimox(user)’, ‘findspark(repo)’), (‘dirmeier(user)’, ‘findspark(repo)’), (‘paulochf(user)’, ‘findspark(repo)’), (‘qingniufly(user)’, ‘findspark(repo)’), (‘hmourit(user)’, ‘findspark(repo)’), (‘ryan-williams(user)’, ‘findspark(repo)’), (‘rholder(user)’, ‘findspark(repo)’), (‘xysmas(user)’, ‘findspark(repo)’), (‘linkTDP(user)’, ‘findspark(repo)’), (‘minimaxir(user)’, ‘findspark(repo)’), (‘KLXN(user)’, ‘findspark(repo)’), (‘quadnix(user)’, ‘findspark(repo)’), (‘luzhijun(user)’, ‘findspark(repo)’), (‘chrinide(user)’, ‘findspark(repo)’), (‘jaredmichaelsmith(user)’, ‘findspark(repo)’), (‘rgbkrk(user)’, ‘findspark(repo)’), (‘jiamo(user)’, ‘findspark(repo)’), (‘drizham(user)’, ‘findspark(repo)’), (‘assumednormal(user)’, ‘findspark(repo)’), (‘cvincent00(user)’, ‘findspark(repo)’), (‘xiaohan2012(user)’, ‘findspark(repo)’), (‘dapurv5(user)’, ‘findspark(repo)’), (‘pchalasani(user)’, ‘findspark(repo)’), (‘benetka(user)’, ‘findspark(repo)’), (‘rohithreddy(user)’, ‘findspark(repo)’), (‘seanjensengrey(user)’, ‘findspark(repo)’), (‘opikalo(user)’, ‘findspark(repo)’), (‘amontalenti(user)’, ‘findspark(repo)’), (‘Bekterra(user)’, ‘findspark(repo)’), (‘Grillz(user)’, ‘findspark(repo)’), (‘nchammas(user)’, ‘findspark(repo)’), (‘markbarks(user)’, ‘findspark(repo)’), (‘jesusjsc(user)’, ‘findspark(repo)’), (‘d2207197(user)’, ‘findspark(repo)’), (‘Erstwild(user)’, ‘findspark(repo)’), (‘henridf(user)’, ‘findspark(repo)’), (‘ranjankumar-gh(user)’, ‘findspark(repo)’), (‘locojay(user)’, ‘findspark(repo)’), (‘WilliamQLiu(user)’, ‘findspark(repo)’), (‘szinya(user)’, ‘findspark(repo)’), (‘seanjh(user)’, ‘findspark(repo)’), (‘vherasme(user)’, ‘findspark(repo)’), (‘stared(user)’, ‘findspark(repo)’), (‘freeman-lab(user)’, ‘findspark(repo)’), (‘wy36101299(user)’, ‘findspark(repo)’), (‘esafak(user)’, ‘findspark(repo)’), (‘napjon(user)’, ‘findspark(repo)’), (‘richardskim111(user)’, ‘findspark(repo)’), (‘SimonArnu(user)’, ‘findspark(repo)’), (‘lmillefiori(user)’, ‘findspark(repo)’), (‘binhe22(user)’, ‘findspark(repo)’), (‘robcowie(user)’, ‘findspark(repo)’), (‘jazzwang(user)’, ‘findspark(repo)’), (‘Dumbris(user)’, ‘findspark(repo)’), (‘giulioungaretti(user)’, ‘findspark(repo)’), (‘cbouey(user)’, ‘findspark(repo)’), (‘andrewiiird(user)’, ‘findspark(repo)’), (‘DaniGate(user)’, ‘findspark(repo)’), (‘rdhyee(user)’, ‘findspark(repo)’), (‘lgautier(user)’, ‘findspark(repo)’), (‘sandysnunes(user)’, ‘findspark(repo)’), (‘willcline(user)’, ‘findspark(repo)’), (‘aliciatb(user)’, ‘findspark(repo)’), (‘d18s(user)’, ‘findspark(repo)’), (‘he0x(user)’, ‘findspark(repo)’), (‘liulixiang1988(user)’, ‘findspark(repo)’), (‘alexandercbooth(user)’, ‘findspark(repo)’), (‘branning(user)’, ‘findspark(repo)’), (‘bguOIQ(user)’, ‘findspark(repo)’), (‘coder-chenzhi(user)’, ‘findspark(repo)’), (‘mdbishop(user)’, ‘findspark(repo)’), (‘Hguimaraes(user)’, ‘findspark(repo)’), (‘ocanbascil(user)’, ‘findspark(repo)’), (‘fish2000(user)’, ‘findspark(repo)’), (‘stzonis(user)’, ‘findspark(repo)’)]
[]

NetWorkX部分统计指标

  • 度中心性(Degree Centrality)
    • 是在网络分析中刻画节点中心性(Centrality)的最直接度量指标。一个节点的节点度越大就意味着这个节点的度中心性越高,该节点在网络中就越重要。某点的度中心性$C_d(i)=\frac{d(i)}{n-1}$,范围为[0,1]。
      • degree_centrality(G) Compute the degree centrality for nodes.
      • in_degree_centrality(G) Compute the in-degree centrality for nodes.
      • out_degree_centrality(G) Compute the out-degree centrality for nodes.
  • 中介中心性/中间中心性(Between Centrality) 。
    • 以经过某个节点的最短路径数目来刻画节点重要性的指标。如果两个不相邻的参与者k和j想要与对方互动而参与者i处在它们的路径上,那么i可能对它们之间的互动拥有一定的控制力。中介性用来度量i对于其他结点的控制能力。即如果i处在非常多结点的交互路径上,那么i就是一个重要的参与者。
      • betweenness_centrality(G[, normalized, …]) Compute betweenness centrality for nodes.
      • edge_betweenness_centrality(G[, normalized, …]) Compute betweenness centrality for edges.
  • 接近中心性(Closeness Centrality)
    • 反映在网络中某一节点与其他节点之间的接近程度。这种中心性的观察视角主要基于接近度或者距离。它的基本思想是如果一个参与者能很容易的与所有其他参与者进行互动,那么它就是中心的。即它到其他所以参与者的距离要足够短。于是,我们就可以使用最短距离来计算这个数值。假设参与者i和参与者j之间的最短距离记为$d(i,j)$(由最短路径上的链接数目度量)。$C_c(i)=\frac{n-1}{\sum_{j=1}^{n}d(i,j)}$
      • closeness_centrality(G[, v, weighted_edges]) Compute closeness centrality for nodes.

作为学习,我们使用Krackhardt kite graph来研究,之所以称之为风筝图,因为长得像风筝。。。(废话)

1
2
3
4
5
6
7
8
9
10
11
from operator import itemgetter
kkg = nx.generators.small.krackhardt_kite_graph()
print ("Degree Centrality")
print (sorted(nx.degree_centrality(kkg).items(), 
             key=itemgetter(1), reverse=True),'\n')
print ("Betweenness Centrality")
print (sorted(nx.betweenness_centrality(kkg).items(), 
             key=itemgetter(1), reverse=True),'\n')
print ("Closeness Centrality")
print (sorted(nx.closeness_centrality(kkg).items(), 
             key=itemgetter(1), reverse=True))

Degree Centrality [(3, 0.6666666666666666), (5, 0.5555555555555556), (6, 0.5555555555555556), (0, 0.4444444444444444), (1, 0.4444444444444444), (2, 0.3333333333333333), (4, 0.3333333333333333), (7, 0.3333333333333333), (8, 0.2222222222222222), (9, 0.1111111111111111)]

Betweenness Centrality [(7, 0.38888888888888884), (5, 0.23148148148148148), (6, 0.23148148148148148), (8, 0.2222222222222222), (3, 0.10185185185185183), (0, 0.023148148148148143), (1, 0.023148148148148143), (2, 0.0), (4, 0.0), (9, 0.0)]

Closeness Centrality [(5, 0.6428571428571429), (6, 0.6428571428571429), (3, 0.6), (7, 0.6), (0, 0.5294117647058824), (1, 0.5294117647058824), (2, 0.5), (4, 0.5), (8, 0.42857142857142855), (9, 0.3103448275862069)]

构建github的兴趣图

找出大神

大神一般都是关注者很高,在社区中很活跃,除了前面加星,类似微博、facebook,github我们还有“关注”关系可以用,现在对前面扩展关注关系,对每个加星的用户扩展。这里简单起见,关注者都是现有的结点,不再新增结点。
由于API每小时限制使用5000个请求,打印下剩余请求数看有没有耗尽,耗尽的话抛出异常。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import sys
for i, sg in enumerate(stargazers):   
    # 增加关注联系,如果有关注者的话
        try:
        for follower in sg.get_followers():
            if follower.login + '(user)' in g:
                g.add_edge(follower.login + '(user)', sg.login + '(user)', 
                           type='follows')
    except Exception: #ssl.SSLError
        sys.stderr.write( "Encountered an error fetching followers for", \
                             sg.login, "Skipping.")

    print ("Processed", i+1, " stargazers. Num nodes/edges in graph", \
          g.number_of_nodes(), "/", g.number_of_edges())
    print ("Rate limit remaining", client.rate_limiting)

rocessed 1 stargazers. Num nodes/edges in graph 81 / 80
Rate limit remaining (4999, 5000)
Processed 2 stargazers. Num nodes/edges in graph 81 / 83
Rate limit remaining (4987, 5000)
Processed 3 stargazers. Num nodes/edges in graph 81 / 84
Rate limit remaining (4985, 5000)
Processed 4 stargazers. Num nodes/edges in graph 81 / 85
Rate limit remaining (4983, 5000)
Processed 5 stargazers. Num nodes/edges in graph 81 / 86
Rate limit remaining (4981, 5000)
Processed 6 stargazers. Num nodes/edges in graph 81 / 90
Rate limit remaining (4968, 5000)

Rate limit remaining (4857, 5000)
Processed 78 stargazers. Num nodes/edges in graph 81 / 95
Rate limit remaining (4856, 5000)
Processed 79 stargazers. Num nodes/edges in graph 81 / 95
Rate limit remaining (4855, 5000)
Processed 80 stargazers. Num nodes/edges in graph 81 / 95
Rate limit remaining (4854, 5000)

接下来,对关联关系统计。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from operator import itemgetter
from collections import Counter

# 看下更新的图信息
print (nx.info(g),'\n')

# 每个打星用户的关注者数量不同
print (len([e for e in g.edges_iter(data=True) if e[2]['type'] == 'follows']),'\n')

# 查看某个打星用户有多少关注者
print (len([e 
           for e in g.edges_iter(data=True) 
               if e[2]['type'] == 'follows' and e[1] == 'freeman-lab(user)']),'\n')


# 打印度最多的前10个结点
print (list(sorted([n for n in g.degree_iter()], key=itemgetter(1), reverse=True)[:10]),'\n')


# 对每个打星用户的关注者数目计数
c = Counter([e[1] for e in g.edges_iter(data=True) if e[2]['type'] == 'follows'])
popular_users = [ (u, f) for (u, f) in c.most_common() if f > 0 ]
print ("Number of popular users", len(popular_users))
print ("Top popular users:", popular_users[:10])

Name: Type: DiGraph Number of nodes: 81 Number of edges: 95 Average in degree: 1.1728 Average out degree: 1.1728

15

4

[(‘findspark(repo)’, 80), (‘rgbkrk(user)’, 6), (‘freeman-lab(user)’, 5), (‘esafak(user)’, 4), (‘andrewiiird(user)’, 4), (‘minimaxir(user)’, 3), (‘nchammas(user)’, 3), (‘rholder(user)’, 2), (‘chrinide(user)’, 2), (‘dapurv5(user)’, 2)]

Number of popular users 9 Top popular users: [(‘freeman-lab(user)’, 4), (‘rgbkrk(user)’, 3), (‘minimaxir(user)’, 2), (‘rholder(user)’, 1), (‘amontalenti(user)’, 1), (‘rdhyee(user)’, 1), (‘stared(user)’, 1), (‘esafak(user)’, 1), (‘nchammas(user)’, 1)]

(‘freeman-lab(user)’, 4)指freeman-lab在findspark项目中的加星者中还有4个关注者,为了更清楚挖掘图信息,用上面介绍的几个图算法来看一下这个图,由于’仓库’结点有大量的度,其他结点相比来说度都很小,因此先删掉’仓库’结点再计算。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from operator import itemgetter
h = g.copy()
# 移除中心结点
h.remove_node('findspark(repo)')

dc = sorted(nx.degree_centrality(h).items(), 
            key=itemgetter(1), reverse=True)

print ("Degree Centrality")
print (dc[:10],'\n')
bc = sorted(nx.betweenness_centrality(h).items(), 
            key=itemgetter(1), reverse=True)

print ("Betweenness Centrality")
print (bc[:10],'\n')

print ("Closeness Centrality")
cc = sorted(nx.closeness_centrality(h).items(), 
            key=itemgetter(1), reverse=True)
print (cc[:10])

Degree Centrality
[(‘rgbkrk(user)’, 0.06329113924050633),
(‘freeman-lab(user)’, 0.05063291139240506),
(‘esafak(user)’, 0.0379746835443038),
(‘andrewiiird(user)’, 0.0379746835443038),
(‘minimaxir(user)’, 0.02531645569620253),
(‘nchammas(user)’, 0.02531645569620253),
(‘rholder(user)’, 0.012658227848101266),
(‘chrinide(user)’, 0.012658227848101266),
(‘dapurv5(user)’, 0.012658227848101266),
(‘amontalenti(user)’, 0.012658227848101266)]

Betweenness Centrality
[(‘rgbkrk(user)’, 0.0009737098344693282),
(‘esafak(user)’, 0.0006491398896462187),
(‘nchammas(user)’, 0.0001622849724115547),
(‘linkTDP(user)’, 0.0),
(‘nehalecky(user)’, 0.0),
(‘charsmith(user)’, 0.0),
(‘cwharland(user)’, 0.0),
(‘d3borah(user)’, 0.0),
(‘cimox(user)’, 0.0),
(‘dirmeier(user)’, 0.0)]

Closeness Centrality
[(‘andrewiiird(user)’, 0.04050632911392405),
(‘esafak(user)’, 0.03375527426160337),
(‘chrinide(user)’, 0.028768699654775604),
(‘rgbkrk(user)’, 0.02531645569620253),
(‘alexandercbooth(user)’, 0.02278481012658228),
(‘dapurv5(user)’, 0.012658227848101266),
(‘Bekterra(user)’, 0.012658227848101266),
(‘nchammas(user)’, 0.012658227848101266),
(‘he0x(user)’, 0.012658227848101266),
(‘Hguimaraes(user)’, 0.012658227848101266)]

来,我们来见见两位大神,一位是数据显示度中心性和中介中心性最大的rgbkrk。 Kyle Kelley来自硅谷,是jupyter、ipython、cloudpipe等大项目的开发组员,影响力很大。
另一位是接近中心性最大的andrewiiird。 andrew来自香港,关注与被关注比率几乎有两个数量级的差别,而Kyle Kelley这个比值接近一比一,也正说明了andrew的“互动性”很强的。(大部分时间关注别人而自己隐藏在暗处,😜)

找出活跃项目

现在图中只有一个项目,其余结点都是用户。为了找出活跃项目,对每个用户,遍历找其加星的项目添加到图中,扩充图的项目结点。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
MAX_REPOS = 500
for i, sg in enumerate(stargazers):
    print (sg.login)
    try:
        for starred in sg.get_starred()[:MAX_REPOS]: # Slice to avoid supernodes
            g.add_node(starred.name + '(repo)', type='repo', lang=starred.language, \
                       owner=starred.owner.login)
            g.add_edge(sg.login + '(user)', starred.name + '(repo)', type='gazes')
    except Exception: #ssl.SSLError:
        print ("Encountered an error fetching starred repos for", sg.login, "Skipping.")

    print ("Processed", i+1, "stargazers' starred repos")
    print ("Num nodes/edges in graph", g.number_of_nodes(), "/", g.number_of_edges())
    print ("Rate limit", client.rate_limiting)

pchalasani
Processed 1 stargazers’ starred repos
Num nodes/edges in graph 176 / 190
Rate limit (4996, 5000)
rgbkrk
Encountered an error fetching starred repos for rgbkrk Skipping.
Processed 2 stargazers’ starred repos
Num nodes/edges in graph 266 / 280
Rate limit (4993, 5000)
napjon
Processed 79 stargazers’ starred repos
Num nodes/edges in graph 12692 / 17579
Rate limit (4369, 5000)
luzhijun
Processed 80 stargazers’ starred repos
Num nodes/edges in graph 12693 / 17581
Rate limit (4368, 5000)

以上操作挺费时间的,结点数由几百个一下子扩充到上万个,可以把MAX_REPOS改小一点,然后去泡杯咖啡吧😁。
拿到所有项目数据,接下来就是要统计哪个是热点项目,可以查询哪个用户mark了哪些项目,他喜欢用啥程序语言,还有那些加星加到手软的用户也可以查询出来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
print (nx.info(g),'\n')
# 拿到图里面的所有项目构成列表
repos = [n for n in g.nodes_iter() if g.node[n]['type'] == 'repo']

# 关注最多的前10个项目
print ("Popular repositories")
print (sorted([(n,d) 
              for (n,d) in g.in_degree_iter() 
                  if g.node[n]['type'] == 'repo'], \
             key=itemgetter(1), reverse=True)[:10])

print "Respositories that luzhijun has bookmarked"
print ([(n,g.node[n]['lang']) 
       for n in g['luzhijun(user)'] 
           if g['luzhijun(user)'][n]['type'] == 'gazes'])

# 用户喜爱的程序语言

print ("Programming languages luzhijun is interested in")
print (list(set([g.node[n]['lang'] 
                for n in g['luzhijun(user)'] 
                    if g['luzhijun(user)'][n]['type'] == 'gazes'])))
#查看关注项目最多的用户(超过MAX_REPOS)
print ("Supernode candidates")
print (sorted([(n, len(g.out_edges(n))) 
              for n in g.nodes_iter() 
                  if g.node[n]['type'] == 'user' and len(g.out_edges(n)) > MAX_REPOS], \
             key=itemgetter(1), reverse=True))

Name:
Type: DiGraph
Number of nodes: 12693
Number of edges: 17581
Average in degree: 1.3851
Average out degree: 1.3851

Popular repositories
[(‘findspark(repo)’, 80), (‘spark(repo)’, 27), (‘tensorflow(repo)’, 24), (‘luigi(repo)’, 21), (‘ipython(repo)’, 21), (‘data-science-ipython-notebooks(repo)’, 20), (‘awesome-public-datasets(repo)’, 20), (‘spark-notebook(repo)’, 19), (‘docker(repo)’, 18), (‘Probabilistic-Programming-and-Bayesian-Methods-for-Hackers(repo)’, 17)]

Respositories that ocanbascil has bookmarked
[(‘til(repo)’, ‘VimL’), (‘react-joyride(repo)’, ‘JavaScript’), (‘django(repo)’, ‘Python’), (‘peewee(repo)’, ‘Python’), (‘select2(repo)’, ‘JavaScript’), (‘requests(repo)’, ‘Python’), (‘benfords-law(repo)’, ‘JavaScript’), (‘daterangepicker(repo)’, ‘JavaScript’), (‘django-zappa(repo)’, ‘Python’), (‘layered(repo)’, ‘Python’), (‘coffeescript(repo)’, ‘CoffeeScript’), (‘awesome-jekyll(repo)’, None), (‘lektor(repo)’, ‘Python’), (‘aetycoon(repo)’, ‘Python’), (‘computer-science(repo)’, None), (‘PerformanceEngine(repo)’, ‘Python’), (‘pattern_classification(repo)’, ‘Jupyter Notebook’), (‘node-v0.x-archive(repo)’, None), (‘django-crispy-forms(repo)’, ‘Python’), (‘hug(repo)’, ‘Python’), (‘euler-coffeescript(repo)’, ‘CoffeeScript’), (‘pandas(repo)’, ‘Python’), (‘pachyderm(repo)’, ‘Go’), (‘eulerclash(repo)’, None), (‘Pipe(repo)’, ‘Python’), (‘data-science-ipython-notebooks(repo)’, ‘Python’), (‘TweetHit(repo)’, ‘Python’), (‘captionmash(repo)’, None), (‘django-tinymce(repo)’, ‘Python’), (‘dataset(repo)’, ‘Python’), (‘shepherd(repo)’, ‘CSS’), (‘data-science-from-scratch(repo)’, ‘Python’), (‘spark(repo)’, ‘Scala’), (‘data-science-blogs(repo)’, ‘Python’), (‘specter(repo)’, ‘Clojure’), (‘tensorflow(repo)’, ‘R’), (‘findspark(repo)’, ‘Python’), (‘localtunnel(repo)’, ‘JavaScript’)]

Programming languages ocanbascil is interested in
[‘CSS’, ‘VimL’, ‘Python’, ‘JavaScript’, ‘Clojure’, None, ‘CoffeeScript’, ‘R’, ‘Jupyter Notebook’, ‘Scala’, ‘Go’]

Supernode candidates [(‘he0x(user)’, 501), (‘esafak(user)’, 501)]

从结果可看出,关注findspark项目的人大约三分之一也关注了spark和tensorflow,TensorFlow是谷歌基于DistBelief进行研发的第二代人工智能学习系统,机器学习果真这么火热。大约四分之一的人关注了Luigi,查了下其主要目的是为了解决需要长期运行的流式批处理任务的管理,可以链接很多个任务,使它们自动化,并进行故障管理,这个做spark任务的人估计也需要,,后面几个看名字都知道和spark密不可分的。
除此之外,这个图还可以查询到某个用户所关注的项目以及这个项目中最多的语言。

找出热门语言

现在问题来了,怎么查询当前图中所有项目最热门的语言?我们可以查找每个项目然后抽取语言放入到counter计数器里面统计,这个看起来也不费事,但如果要查询有多少用户以某种语言编程呢?这就要扫描每个项目的入度用户,最后统计求和,时间复杂度挺大的。这个时候一种好的思路就是扩充图,将语言单独作为一个节点,最终图的逻辑图如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
repos = [n 
         for n in g.nodes_iter() 
             if g.node[n]['type'] == 'repo']

for repo in repos:
    # 消除None,有些空项目语言为None
    lang = (g.node[repo]['lang'] or "") + "(lang)"
    # 加星于repo的用户
    stargazers = [u 
                  for (u, r, d) in g.in_edges_iter(repo, data=True) 
                     if d['type'] == 'gazes'
                 ]
    for sg in stargazers:
        g.add_node(lang, type='lang')
        g.add_edge(sg, lang, type='programs')
        g.add_edge(lang, repo, type='implements')   

接下来看下对语言的统计信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
print (nx.info(g),'\n')
# 看下图里有什么语言
print ([n 
       for n in g.nodes_iter() 
           if g.node[n]['type'] == 'lang'])
# 某个用户使用的语言
print ([n 
       for n in g['ocanbascil(user)'] 
           if g['ocanbascil(user)'][n]['type'] == 'programs'],'\n')
# 查询最热门的语言
print ("Most popular languages")
print (sorted([(n, g.in_degree(n))
 for n in g.nodes_iter() 
     if g.node[n]['type'] == 'lang'], key=itemgetter(1), reverse=True)[:10])
# 查询用某种语言的有多少人
python_programmers = [u 
                      for (u, l) in g.in_edges_iter('Python(lang)') 
                          if g.node[u]['type'] == 'user']
print ("Number of Python programmers:", len(python_programmers))

javascript_programmers = [u for 
                          (u, l) in g.in_edges_iter('JavaScript(lang)') 
                              if g.node[u]['type'] == 'user']
print ("Number of JavaScript programmers:", len(javascript_programmers))

# 两个语言都用的人
print ("Number of programmers who use JavaScript and Python")
print (len(set(python_programmers).intersection(set(javascript_programmers))))

# 只用python不用js的人
print ("Number of programmers who use JavaScript but not Python")
print (len(set(javascript_programmers).difference(set(python_programmers))))

# XXX: Can you determine who is the most polyglot programmer?

Name: Type: DiGraph Number of nodes: 12807 Number of edges: 31824 Average in degree: 2.4849 Average out degree: 2.4849

[‘LiveScript(lang)’, ‘APL(lang)’, ‘Swift(lang)’, ‘Java(lang)’, ‘Nix(lang)’, ‘Erlang(lang)’, ‘Common Lisp(lang)’, ‘Python(lang)’, ‘GCC Machine Description(lang)’, ‘Prolog(lang)’, ‘PigLatin(lang)’, ‘Vala(lang)’, ‘SAS(lang)’, ‘FORTRAN(lang)’, ‘Cuda(lang)’, ‘Matlab(lang)’, ‘HCL(lang)’, ‘F#(lang)’, ‘Verilog(lang)’, ‘Emacs Lisp(lang)’, ‘Inno Setup(lang)’, ‘Vue(lang)’, ‘TeX(lang)’, ‘DTrace(lang)’, ‘Processing(lang)’, ‘Hack(lang)’, ‘Lua(lang)’, ‘Assembly(lang)’, ‘Parrot(lang)’, ‘Shell(lang)’, ‘Arduino(lang)’, ‘R(lang)’, ‘C#(lang)’, ‘Rust(lang)’, ‘Standard ML(lang)’, ‘Puppet(lang)’, ‘Gosu(lang)’, ‘C(lang)’, ‘PHP(lang)’, ‘Scala(lang)’, ‘Gnuplot(lang)’, ‘Crystal(lang)’, ‘Objective-J(lang)’, ‘COBOL(lang)’, ‘Cucumber(lang)’, ‘ApacheConf(lang)’, ‘Brainfuck(lang)’, ‘Kotlin(lang)’, ‘Pascal(lang)’, ‘Pike(lang)’, ‘RAML(lang)’, ‘Lean(lang)’, ‘Logos(lang)’, ‘Elixir(lang)’, ‘Dart(lang)’, ‘C++(lang)’, ‘CMake(lang)’, ‘Clojure(lang)’, ‘D(lang)’, ‘Batchfile(lang)’, ‘OpenSCAD(lang)’, ‘Coq(lang)’, ‘PowerShell(lang)’, ‘Visual Basic(lang)’, ‘VimL(lang)’, ‘NetLogo(lang)’, ‘OCaml(lang)’, ‘VHDL(lang)’, ‘Ruby(lang)’, ‘Go(lang)’, ‘Metal(lang)’, ‘Jupyter Notebook(lang)’, ‘KiCad(lang)’, ‘Smarty(lang)’, ‘Tcl(lang)’, ‘CoffeeScript(lang)’, ‘QML(lang)’, ‘(lang)’, ‘Scheme(lang)’, ‘HTML(lang)’, ‘Makefile(lang)’, ‘CartoCSS(lang)’, ‘DIGITAL Command Language(lang)’, ‘AppleScript(lang)’, ‘Perl6(lang)’, ‘Protocol Buffer(lang)’, ‘Julia(lang)’, ‘Awk(lang)’, ‘Elm(lang)’, ‘Haskell(lang)’, ‘TLA(lang)’, ‘Nimrod(lang)’, ‘JavaScript(lang)’, ‘Max(lang)’, ‘TypeScript(lang)’, ‘GAP(lang)’, ‘Scilab(lang)’, ‘Web Ontology Language(lang)’, ‘XSLT(lang)’, ‘Objective-C++(lang)’, ‘Perl(lang)’, ‘PLSQL(lang)’, ‘PureScript(lang)’, ‘PLpgSQL(lang)’, ‘Eagle(lang)’, ‘Objective-C(lang)’, ‘Racket(lang)’, ‘OpenEdge ABL(lang)’, ‘Groovy(lang)’, ‘Groff(lang)’, ‘Frege(lang)’, ‘NSIS(lang)’, ‘Mathematica(lang)’, ‘CSS(lang)’]

[‘JavaScript(lang)’, ‘R(lang)’, ‘Python(lang)’, ‘Jupyter Notebook(lang)’, ‘VimL(lang)’, ‘Scala(lang)’, ‘Clojure(lang)’, ‘CoffeeScript(lang)’, ‘Go(lang)’, ‘(lang)’, ‘CSS(lang)’]

Most popular languages
[(‘Python(lang)’, 80), (‘(lang)’, 74), (‘JavaScript(lang)’, 74), (‘Jupyter Notebook(lang)’, 69), (‘Java(lang)’, 67), (‘HTML(lang)’, 67), (‘C++(lang)’, 66), (‘Scala(lang)’, 65), (‘Shell(lang)’, 62), (‘C(lang)’, 62)]

Number of Python programmers: 80
Number of JavaScript programmers: 74
Number of programmers who use JavaScript and Python
74
Number of programmers who use JavaScript but not Python
0

从结果看出,github上喜欢的语言排行榜中js、r、python等脚本语言占据主流,虽然参与统计的项目只有一万多条,但这也应着了一句老话“github是前端的天下”😁。这里Scala的关注度也很靠前,主要因为“项目之源”是findspark,go这里也很火,java死哪去了。。。,估计挖个hadoop的相关项目就有他了。。。
另一个有趣的现象是,玩python的程序员必会玩js,玩js的人不一定会玩python(80-74=6个人不会玩python),正符合现在IT界的现状啊,这个差别不明显,换个国内的项目估计就明显了。。。,估计还有多少人不知道github,这里就统计不了。

以上数据规模并不算大,如果有兴趣可以访问GitHub Archive,GitHub Archive 提供海量全局层面的数据,调可以使用一些推荐的“大数据”工具来探索它。

可视化

依靠Networkx的导出能力,可以通过JSON的JavaScript工具包D3使用,但也有许多其他的工具包,你可以考虑当可视化图。Graphviz是高度可配置的经典工具,可以设计出非常复杂的图形和位图图像。传统上在终端上运行,但现在也有支持大多数平台的用户界面。另一个流行的开源项目是Gephi,其在交互上做的很好,在过去的几年中Gephi已经迅速流行,非常值得推荐。 这里由于有上万个项目结点,一张图乌漆嘛黑都看不到,并且很卡。所以下面只截取用户点和语言点的可视化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import os
import json
from IPython.display import IFrame
from IPython.core.display import display 
from networkx.readwrite import json_graph
print ("Stats on the full graph" )
print (nx.info(g))
# 只提取用户和语言结点
mtsw_users = [n for n in g if g.node[n]['type'] == 'user'] +[n for n in g if g.node[n]['type'] == 'lang'] 
h = g.subgraph(mtsw_users)
print ("Stats on the extracted subgraph" )
print (nx.info(h))
# json导出
d = json_graph.node_link_data(h)
json.dump(d, open('force.json', 'w'))
viz_file = 'files/force.html'
# D3可视化
display(IFrame(viz_file, '100%', '900px'))

下面的动态页面加载不出请点击动态图

如果要分析语言之间的关联,其实也可以从一个项目中拿到多个语言,比如是a,b,c,然后做无向图,标记a-b-c 权重为1,当在其他项目中还有a,b同时出现时,a-b 权重为2,以此类推,最后直接放mjwillson的成品:

ref

  • https://en.wikipedia.org/wiki/Centrality
  • https://github.com/mjwillson/ProgLangVisualise
  • https://developer.github.com/v3/
  • http://networkx.github.io
  • https://github.com/PyGithub/PyGithub
  • Russell, Matthew A. Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. O’Reilly Media, Inc. 2013.
  • Csardi, Gabor. “Eigenvalues and eigenvectors of the adjacency matrix of a graph.”.
  • Barthélemy, M. “Betweenness centrality in large complex networks.” Physics 38.2(2004):163-168.