以为省市区县镇村这样的名字组成的字典会有现成的下载,结果比想的要复杂,只能自己去提取。在网上下来了一个到处的sql文件,导入到本地的mysql数据库,接下来的思路就是distict对应的字段,然后逐行写入到字典文件。在python下完成。

安装PyMySQL包

安装python的开发环境就不说了,我是用的PyCharm,之前研究了如何使用,因此安装一个依赖包的操作还比较简单。

遇到了一个坑:中文乱码

中文乱码多半是因为数据集的原因,只要把数据库和数据表的字符集修改为utf8,以及客户端的字符集修改为utf8
解决python中mysql查询中文乱码的问题
显示数据库字符集

show variables like 'character%'

设置字符集

set character_set_server =utf8

导出字典的部分代码如下:

import pymysql

class AnManMysql:
    phost = '127.0.0.1'
    pyuer = 'root'
    password = '123456'
    database = 'anman_org'
    def pyMyConnection(phost,pyuser,password,database):
        db = pymysql.connect(phost,pyuser,password,database,charset="utf8")
        cursor = db.cursor()
        return cursor

    # 导出省市县镇村字典
    def getDict(cursor):
        cursor.execute("select distinct province_name from j_position")
        data = cursor.fetchall()
        # 打开字典文件my.dict
        fo = open("my.dict","w+")
        for da in data:
            print(da[0])
            fo.seek(0,2)
            fo.write(da[0]+"\n")
        print(len(data))
        cursor.execute("select distinct city_name from j_position")
        data = cursor.fetchall()
        for da in data:
            print(da[0])
            fo.seek(0, 2)
            fo.write(da[0] + "\n")
        print(len(data))
        cursor.execute("select distinct county_name from j_position")
        data = cursor.fetchall()
        for da in data:
            print(da[0])
            fo.seek(0, 2)
            fo.write(da[0] + "\n")
        print(len(data))
        cursor.execute("select distinct town_name from j_position")
        data = cursor.fetchall()
        for da in data:
            print(da[0])
            fo.seek(0, 2)
            fo.write(da[0] + "\n")
        print(len(data))
        cursor.execute("select distinct village_name,village_id from j_position")
        data = cursor.fetchall()
        for da in data:
            print(da[0])
            print(da[1])
            fo.seek(0, 2)
            fo.write(da[0] + "\n")
        print(len(data))
        fo.close()

cursor = AnManMysql.pyMyConnection(AnManMysql.phost, AnManMysql.pyuer, AnManMysql.password, AnManMysql.database)
AnManMysql.getProvinceDict(cursor)

最终得到的字典文件,大约10M,一共703205行

标签: python

仅有一条评论

添加新评论