CyannyLive

AI and Big Data

读书有哪些方法? 2021都读了哪些书

最近听了罗振宇的演讲, 里面提到了坐电梯和攀岩两种模式, 我觉得挺形象的, 我感觉我的学习方法也有下面两种:
第一, 电梯法, 选择几本书, 一点点慢慢看, 所谓”九层妖塔, 起于累土”, 也不用想太多, 等待某个时候, 就登顶了. 这个方法的局限性是, 增加知识量, 但思考的锻炼不够, 除非花费时间写总结做归纳. 然而懒的时候就只能观其大略, 最近在看的几本书:

Dynamic Programming Summary

1. 动态规划之道

  • DP问题的特征
    • 最优子结构: 原问题是一个最优化问题, 可递归地拆分为多个子问题, 通过数学方法组合各个子问题的最优解, 可以求得问题的最优解
    • 重叠的子问题: 子问题相互重叠, 例如斐波拉契数列问题, 而子问题如果不重叠, 可以用一般的递归
    • 无后效性: 某一个状态一旦确定, 就不受这个状态以后决策的影响, 例如地下城游戏

An Introduction to Bayesian Networks

冬日晴好, 下午看完了论文, 对Bayesian Network是什么有了系统的了解.论文是causalnex工具里提到的
Stephenson, Todd Andrew. An introduction to Bayesian network theory and usage. No. REP_WORK. IDIAP, 2000.

该论文主要论述了以下几点:

  • What is Bayesian network
  • Inference Bayesian network: junction tree algorithm
  • Learning Bayesian Network
  • Applications
    • Automatic Speech Recognition: Dynamic Bayesian Network
    • Computer troubleshooting
    • Medical diagnosis

Structure Learning Algorithm NOTEARS

最近三年扎进了AI领域, 学了很多算法, 最近开始真正拉高维度看AI, AI不仅仅是Machine Learning, 还有State Based, Variable Bases, Logic编程等方法. 最近半年看了The book of Why, 深受启发, 看世界的角度也发生很大变化, 同时也觉得因果推理将是一个值得研究的好领域, 就算目前落地场景不多, 相信未来也是大有可为.

今天静下来, 好好看了在CausalNex库中, 用到的算法NOTEARS, 用于结构学习, 该论文发表在2018的NIPS, 方法神奇, 解决方案简洁, 以下是自己的一些笔记:

Paper: Zheng, Xun, et al. “DAGs with NO TEARS: Continuous optimization for structure learning.” Advances in Neural Information Processing Systems 31 (2018): 9472-9483.

Akka Http Notes

在快3年多的Scala项目编程中, Akka是我见过的比较高质量的scala库, 其核心抽象是一种基于Actor的编程模型, 同时在这个核心抽象上, 提供一组工具库, 用户只需要按Actor形式写业务逻辑, 框架会帮你处理好底层的消息传递, 高并发和IO问题. Akka在工业场景下, 很接底气, 比如有很多微服务, 服务的性能各有差异, 这时候你需要整合这些微服务, 完成比如广告投放, 在线推荐, 事故检测等业务, Akka的业务抽象就会有很大的用处.

而最近系统看了Akka-HTTP, 我个人比较喜欢这个库在meta-programming方面的应用, akka-http把一个老生常谈的HTTP库实现的很优雅, 设计和抽象值得推敲, 时间有限, 就看了一周, 以下是一些最近对我帮助比较大的总结, 如果以后有空会继续完善

1.Akka HTTP 优势

定位: 用于处理复杂业务的Library, 不是一个MVC Framework(such as Play)

  • DSL with convenient pathMatchers
  • Streaming: 流式传输, 速率限制
  • Interacting with actor easy

Java Performance Toolbox

I learned The Java Performance Definitive Guide[chapter 3] on this weekend, here is a brief summary about Java Performance Toolbox.

System Monitoring Tools

1. CPU Usage

vmstat: Report virtual memory statistics, vmstat reports information about processes, memory, paging, block IO, traps, disks and cpu activity
vmstat [options] [delay [count]]

Big Data and ML Learning

随着工作的时间一天天过去,不禁会思考对未来的打算,工作的事情更多的是业务和效果,少有时间学习,自我的提升比起学校需要更多的self motivation. 一直都工作在大数据领域,现在虽然业务多些,方向也没有变化,还有了很多机器学习方面的实践。以下是我觉得自己很希望学习的书籍和要点:

Awesome Books for 2018

One of my 2018 reservations is reading more books. Here I list some great books in my plan.

Machine Learning

  • Machine Learning: A Probabilistic Perspective
  • Deap Learning(Ian,Goodfellow)
  • Pattern Recognition and Machine Learning(Christopher M Bishop)
  • The elements of statistic learning
  • Hands-On Machine Learning with Scikit-Learn and TensorFlow (in progress now)
  • Python Machine Learning
  • 数学之美
  • 统计学(复习)
  • 统计学习方法
  • 机器学习

Ppmml Publish Today

On the last day before the New Year Holiday, ppmml is published.
ppmml is a python library for converting machine learning models to pmml file. ppmml wraps jpmml libraries and provides clean interface.

What is pmml file?

PMML - “Predictive Model Markup Language”, which is a standard for XML documents which express trained instances of analytic models.
Various platforms adopt pmml as machine learning model standard, including IBM, SAS, Microsoft, Spark, KNIME etd.pmml-platforms

jpmml has developed pmml model library and supported models of spark, xgboost, tensorflow, sklearn, lightgbm and R. All of these libraries are separated and written in java.
ppmml wraps jpmml libraries and proved a simple and easy-to-use API for pmml files transformation.
0.0.1 version supports sklearn, tensorflow, spark, lightgbm, xgboost and R models. All models supported by jpmml are supported by ppmml. Common machine learning algorithms are supported, such as Decision Tree, Logistic Regression, GBDT, Random Forest, KMeans. However, Deep Learning support is not ready.

How to Use Scala UDF and UDAF in PySpark

Spark DataFrame API provides efficient and easy-to-use operations to do analysis on distributed collection of data. Many users love the Pyspark API, which is more usable than scala API. Sometimes when we use UDF in pyspark, the performance will be a problem. How about implementing these UDF in scala, and call them in pyspark? BTW, in spark 2.0, UDAF can only be defined in scala, and how to use it in pyspark? Let’s have a try~

Copyright
© 2022 Cyanny Liang